<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[The Silicon Corner]]></title><description><![CDATA[The Silicon Corner]]></description><link>https://blog.pol.company</link><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 02:50:11 GMT</lastBuildDate><atom:link href="https://blog.pol.company/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Decision trees uncovered]]></title><description><![CDATA[If you are a computer scientist I am sure you agree with me when I say that trees are everywhere. And I mean everywhere! It is extremely common to use trees as a basic data structure to improve and define new algorithms in all sorts of domains. Machi...]]></description><link>https://blog.pol.company/decision-trees-uncovered</link><guid isPermaLink="true">https://blog.pol.company/decision-trees-uncovered</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Decision trees]]></category><category><![CDATA[algorithms]]></category><dc:creator><![CDATA[Pol Monroig Company]]></dc:creator><pubDate>Thu, 09 Oct 2025 06:30:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/GEyXGTY2e9w/upload/87408686a45401e0fed31f15209fa800.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you are a computer scientist I am sure you agree with me when I say that trees are everywhere. And I mean everywhere! It is extremely common to use trees as a basic data structure to improve and define new algorithms in all sorts of domains. Machine learning is no different; decision trees are one of the most used nonparametric methods, it can be used for both classification and regression.</p>
<p>Decision trees are hierarchical models that work by splitting the input space into smaller regions. A tree is composed of <strong>internal decision nodes</strong> and <strong>terminal leaves</strong>. Internal decision nodes implement a <strong>test function</strong>, this function works by given a set of variables (the most used approach is to use <strong>univariate trees</strong>, that is trees that test only 1 variable at a given node) we get a discrete output corresponding to which child node we should go next. Terminal nodes correspond to predictions; a classification output might be the corresponding class, and a regression a specific numerical value. A great advantage of decision trees is that they can work using categorical values directly.</p>
<p>For example, in the following tree, we might want to classify patients that required treatment versus patients that do not require it. Each node makes a decision based on a simple rule, and in each terminal nodes, we have the final prediction. The gini index is a measure of how impure the node is. If the impurity is equal to 0.0, that means we cannot split any further because we have reached a <strong>maximum purity</strong>.</p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F5safv416ocbxb77a5d68.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F5safv416ocbxb77a5d68.png" alt="Alt Text" /></a></p>
<p>One of the perks of decision trees, compared to other machine learning algorithms, is that they are extremely easy to understand and have a high degree of interpretability. Just by reading the tree, you can make decisions yourself. On the other hand, decision trees are very sensitive to small variations in the training data, so it usually recommended to apply a boosting method.</p>
<p><em>Note: In fact, a decision tree can be transformed into a series of rules that can then be used in a rule-based language such as Prolog.</em></p>
<h1 id="heading-dimensionality-reduction">Dimensionality reduction</h1>
<p>The job of classification and regression trees (<strong>CART</strong>) is to predict an output based on the possible variables that the input might have; <strong>higher leaves</strong> tend to divide more important features and <strong>lower leaves</strong> tend to correspond to less important ones. That is why decision trees are commonly used as a dimensionality reduction technique. By running the CART algorithm you get the importance of each feature for free!</p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fig3jqq7bb9219t09qk1y.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fig3jqq7bb9219t09qk1y.png" alt="Alt Text" /></a></p>
<h1 id="heading-error-measures">Error measures</h1>
<p>As any machine learning model, we must ensure to have a correct error function. It has been shown that any of the following error functions tend to perform well:</p>
<ul>
<li><p><strong>MSE (</strong>regression): it is one of the most common error function on machine learning</p>
</li>
<li><p><strong>Entropy</strong> (classification): entropy works by measuring the number of bits needed to encode a class code, based on its probability of occurrence.</p>
</li>
<li><p><strong>Gini index</strong> (classification): Slightly faster impurity measure than entropy, it tends to isolate the most frequent class in its own branch of the tree, while entropy produces more balanced branches.</p>
</li>
</ul>
<p><em>No</em><a target="_blank" href="https://dev.to/polcompany/decision-trees-uncovered-5907#boosting-trees"><em>te:</em></a> <em>Error functions on classification trees are also called impurity measures.</em></p>
<h1 id="heading-boosting-trees">Boosting trees</h1>
<p>Decision trees are very good estimators, but sometimes they can perform poorly. Fortunately, there are many ensemble methods to boost their performance.</p>
<ul>
<li><p><strong>Random forests</strong>: bagging/pasting method that works by training multiple decision trees, each with a subset of the dataset. Finally, each tree makes a prediction and they are all aggregated into a single final prediction.</p>
</li>
<li><p><strong>AdaBoost</strong>: a first base classifier is trained and used to make predictions. Then, a second classifier is trained on the errors that the first one had. This continues on and on until there are no more classifiers to train.</p>
</li>
<li><p><strong>Stacking</strong>: this idea works by creating a voting mechanism between different classifiers and create a blending classifier that is trained on the predictions of the other classifiers, instead of the data directly.</p>
</li>
</ul>
<p><em>The following image represents a stacking ensemble:</em></p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw9zysh39d8gckg2bfbif.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw9zysh39d8gckg2bfbif.png" alt="Alt Text" /></a></p>
]]></content:encoded></item><item><title><![CDATA[Optimal neural networks]]></title><description><![CDATA[Like everything in this world, finding the right path to a high-end goal can become tedious if you don't have the right tools. Each objective and environment has different requirements and must be treated differently. An example of this might be trav...]]></description><link>https://blog.pol.company/optimal-neural-networks</link><guid isPermaLink="true">https://blog.pol.company/optimal-neural-networks</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[pytorch]]></category><dc:creator><![CDATA[Pol Monroig Company]]></dc:creator><pubDate>Wed, 08 Oct 2025 21:56:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759960481621/ff244fce-ad03-47b1-a5ae-5e84888567b8.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Like everything in this world, finding the right path to a high-end goal can become tedious if you don't have the right tools. Each objective and environment has different requirements and must be treated differently. An example of this might be traveling, using a car to go to the grocery shop might be the fastest and most comfortable way to get there. On the other hand, if we want to travel abroad it might be a better idea to get on an airplane (unless you are one of those who loves driving for hours).</p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F7emue40akmlpyeqe2c52.jpg"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F7emue40akmlpyeqe2c52.jpg" alt="Alt Text" /></a></p>
<p>But we are not here to talk about the different types of transportation, we are here to talk about how to improve the training of your neural networks and choosing the best optimizer based on the memory it uses, its complexity and speed.</p>
<h1 id="heading-different-optimizers">Different optimizers</h1>
<p>Training a deep neural network can be very slow, there are multiple ways to improve the speed of convergence. By improving the learning rules of the optimizer we can make the network learn faster (with some computational and memory cost).</p>
<h3 id="heading-simple-optimizer-sgd">Simple optimizer SGD</h3>
<p>The most simple optimizer out there is a Stochastic Gradient Descent optimizer, this works by calculating the gradient and error through backpropagation and updating the corresponding weights with the learning rate factor.</p>
<p><strong>Speed</strong>: because it is the most basic implementation it is the fastest</p>
<p><strong>Memory</strong>: it is also the one that uses the fewest memory since it only needs to save the gradients of each weight for backpropagation.</p>
<p><strong>Performance</strong>: it has a very slow convergence but generalizes better than most methods.</p>
<p><strong>Usage</strong>: this function can be used in pytorch by providing the models parameters (weights) and the learning rate, the rest of the parameters are optional.  </p>
<pre><code class="lang-python">torch.optim.SGD(params, lr=&lt;required parameter&gt;, momentum=<span class="hljs-number">0</span>, dampening=<span class="hljs-number">0</span>, weight_decay=<span class="hljs-number">0</span>, nesterov=<span class="hljs-literal">False</span>)
</code></pre>
<h3 id="heading-momentum-optimization">Momentum optimization</h3>
<p>The momentum optimization is a variant of the SGD that incorporates the previous update in the current change as if there is a <strong>momentum</strong>. This momentum provides a smoothing effect on the training. The value of the momentum is usually between 0.5 and 1.0</p>
<p><strong>Speed</strong>: very fast since it only has an additional multiplication.</p>
<p><strong>Memory</strong>: this optimization requires a memory increase, since it needs to save the memory of the weight of the update in the last step.</p>
<p><strong>Performance</strong>: very useful since it provides an averaging and smooth effect in the trajectory during convergence. It promotes a faster convergence and helps roll past local optima. It almost always goes faster than SGD.</p>
<p><strong>Usage</strong>: to activate the momentum, you need to specify its value through the momentum parameter.  </p>
<pre><code class="lang-python">torch.optim.SGD(params, lr=&lt;required parameter&gt;, momentum&gt;<span class="hljs-number">0</span>, dampening=<span class="hljs-number">0</span>, weight_decay=<span class="hljs-number">0</span>, nesterov=<span class="hljs-literal">False</span>)
</code></pre>
<h3 id="heading-nesterov-accelerated-gradient">Nesterov accelerated gradient</h3>
<p>A variant of the momentum optimization was proposed in which instead of mesuring the gradient at the local position,we measure it in the direction of the momentum.</p>
<p><strong>Speed</strong>: an additional sum must be done to apply the momentum to the parameter.</p>
<p><strong>Memory</strong>: no extra memory is used in this case.</p>
<p><strong>Performance</strong>: it usually works better than simple momentum since the momentum vector points towards the optimum. In general, it converges faster than the original momentum since we are promoting the movement towards a specific direction.</p>
<p><strong>Usage</strong>: to apply the use of Nesterov we must set the Nesterov flag to true and add some momentum to the optimizer.  </p>
<pre><code class="lang-python">torch.optim.SGD(params, lr=&lt;required parameter&gt;, momentum&gt;<span class="hljs-number">0</span>, dampening=<span class="hljs-number">0</span>, weight_decay=<span class="hljs-number">0</span>, nesterov=<span class="hljs-literal">True</span>)
</code></pre>
<h3 id="heading-adagrad">Adagrad</h3>
<p>Adagrad stands for <strong>adaptive learning rate</strong> and it works by adapting the learning rate depending on where we are located. When we are near a local minimum, Adagrad tries to <a target="_blank" href="https://dev.to/polcompany/optimal-neural-networks-2l93#summary">op</a>timize the learning rate in order to get faster in that direction. A benefit of using this optimizer is that we don't need to concern ourselves too much in tuning the learning rate manually. The learning rate adapts based on all the gradients in the current training.</p>
<p><strong>Speed</strong>: it is much slower since it needs to multiply a lot of things.</p>
<p><strong>Memory</strong>: it does not require any additional memory.</p>
<p><strong>Performance</strong>: in general, it works well for simple quadratic problems, but it often stops too early when training neural networks, since the learning rate gets scaled too much, thus never gettin<a target="_blank" href="https://dev.to/polcompany/optimal-neural-networks-2l93#summary">g t</a>o the <a target="_blank" href="https://dev.to/polcompany/optimal-neural-networks-2l93#summary">min</a>imum. It is not recommended for neural networks but it may be efficient for simpler problems.</p>
<p><strong>Usage</strong>: Adagrad can be used by providing the default parameters.  </p>
<pre><code class="lang-python">torch.optim.Adagrad(params, lr=<span class="hljs-number">0.01</span>, lr_decay=<span class="hljs-number">0</span>, weight_decay=<span class="hljs-number">0</span>, initial_accumulator_value=<span class="hljs-number">0</span>, eps=<span class="hljs-number">1e-10</span>)
</code></pre>
<h3 id="heading-rmsprop">RMSprop</h3>
<p>This is a variant of the Adagrad algorithm that fixes its never converging issue. It does it by accumulating only the gradients from the most recent iterations.</p>
<p><strong>Speed</strong>: it is very similar to Adagrad</p>
<p><strong>Memory</strong>: it uses the same memory as Adagrad</p>
<p><strong>Performance</strong>: it converges much faster than Adagrad and does not stop before a local minimum, it. It has been used by machine learning researches for a long time before Adam came out. It does not perform very well on very simple problems.</p>
<p><strong>Usage</strong>: you might notice there is a new hyperparameter, but the default values usually work well, this technique can be combined with a momentum.  </p>
<pre><code class="lang-python">torch.optim.RMSprop(params, lr=<span class="hljs-number">0.01</span>, alpha=<span class="hljs-number">0.99</span>, eps=<span class="hljs-number">1e-08</span>, weight_decay=<span class="hljs-number">0</span>, momentum=<span class="hljs-number">0</span>, centered=<span class="hljs-literal">False</span>)
</code></pre>
<h3 id="heading-adam">Adam</h3>
<p>Adam is a relatively new gradient descent optimization method, it stands for adaptive moment estimation. It is a mix between momentum optimization and RMSProp.</p>
<p><strong>Speed</strong>: the one that costs more since it combines two methods.</p>
<p><strong>Memory</strong>: the same as RMSprop</p>
<p><strong>Performance</strong>: it usually performs better than RMSprop since it a combination of techniques trying to converge faster on the training data.</p>
<p><strong>Usage</strong>: Adam can be used perfectly with the default parameters, it is even recommended to leave the learning rate as it is since it is an adaptive method that provides an automatic learning rate update.  </p>
<pre><code class="lang-python">torch.optim.Adam(params, lr=<span class="hljs-number">0.001</span>, betas=(<span class="hljs-number">0.9</span>, <span class="hljs-number">0.999</span>), eps=<span class="hljs-number">1e-08</span>, weight_decay=<span class="hljs-number">0</span>, amsgrad=<span class="hljs-literal">False</span>)
</code></pre>
<h1 id="heading-summary">Summary</h1>
<p>In the end, which optimization algorithm should you use? It depends, adaptive algorithms are becoming really fancy nowadays but require more <strong>computational power,</strong> and most of the time more <strong>memory</strong>. It has been proven that simple SGD has better results on the validation set, as it tends to generalize better, it seems adaptive algorithms try to optimize the training set too much, thus ending with high variance and overfitting the data. The problem with SGD is that it might take a lot of time to reach a minimum, the computational resources needed in total are much higher than the ones needed in adaptive optimizations. So in the end, if you have a lot of computer resources you should consider using SGD with momentum as it tends to generalize better. On the other hand, if your resources, especially <strong>time resources</strong>, are limited Adam is your best choice.</p>
]]></content:encoded></item><item><title><![CDATA[The Perfect Activation]]></title><description><![CDATA[It might be too bold to call an activation function perfect, given that the No Free Lunch Theorem of machine learning states that there is no universally perfect machine learning algorithm. Nevertheless, as misleading as the title can be, I will try ...]]></description><link>https://blog.pol.company/the-perfect-activation</link><guid isPermaLink="true">https://blog.pol.company/the-perfect-activation</guid><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Pol Monroig Company]]></dc:creator><pubDate>Wed, 08 Oct 2025 21:41:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759959690208/a3bebeef-baa5-4b24-ada6-5895c2a39cab.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It might be too bold to call an activation function perfect, given that the <strong>No Free Lunch Theorem</strong> of machine learning states that there is no universally perfect machine learning algorithm. Nevertheless, as misleading as the title can be, I will try to summarize the most widely used activation functions and describe their main differences.</p>
<h1 id="heading-linear-identity">Linear (identity)</h1>
<p>The linear activation function is essentially no activation at all.<br /><strong>Overhead:</strong> fastest, no computation at all<br /><strong>Performance:</strong> bad, since it does not enable a non linear transformation<br /><strong>Advantages:</strong></p>
<ul>
<li><p>Differentiable at all points</p>
</li>
<li><p>Fast execution</p>
</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li>Does not provide any non-linear output.</li>
</ul>
<h1 id="heading-sigmoid">Sigmoid</h1>
<p>The Sigmoid activation function is one of the oldest ones. Initially made to mimic the activations in the brain it has been shown to have poor performance on artificial neural networks, nevertheless it is commonly used and a classifier output to transform outputs into class probabilities.  </p>
<p><strong>Uses:</strong> it is commonly used in the output layer of binary classification where we need a probability value between 0 and 1.<br /><strong>Overhead:</strong> very expensive because of the exponential term.<br /><strong>Performance:</strong> bad on hidden layers, mostly used on output layers<br /><strong>Advantages:</strong></p>
<ul>
<li><p>Outputs are between 0 and 1, that means that values won't explode.</p>
</li>
<li><p>It is differentiable at every point.</p>
</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li><p>Outputs are between 0 and 1, that means outputs might saturate.</p>
</li>
<li><p>Vanishing gradients are possible.</p>
</li>
<li><p>Outputs are always positive ( zero centered functions help in a faster convergence).</p>
</li>
</ul>
<p><strong>Code:</strong>  </p>
<pre><code class="lang-python"><span class="hljs-comment"># Pytorch </span>
torch.nn.Sigmoid() 
<span class="hljs-comment"># Tensorflow </span>
tf.keras.activations.sigmoid()
</code></pre>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1mt954pdqqsoha16ear1.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1mt954pdqqsoha16ear1.png" alt="Alt Text" /></a></p>
<h1 id="heading-softmax">Softmax</h1>
<p>Generalization of the Sigmoid function to more than one class, it enables to transform the outputs into multiple probabilities. Used in multiclass classification.<br /><strong>Uses:</strong> used in the output layer of a multiclass neural network.<br /><strong>Overhead:</strong> similar to Sigmoid, but more overhead caused by more inputs.<br /><strong>Performance:</strong> bad on hidden layers, mostly used on output layers<br /><strong>Advantages:</strong></p>
<ul>
<li>Unlike Sigmoid, it ensures that outputs are normalized between 0 and 1</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li>Same as Sigmoid.</li>
</ul>
<p><strong>Code:</strong>  </p>
<pre><code class="lang-python"><span class="hljs-comment"># Pytorch </span>
torch.nn.Softmax(dim=...) 
<span class="hljs-comment"># Tensorflow </span>
tf.keras.activations.softmax()
</code></pre>
<h1 id="heading-hyperbolic-tangent">Hyperbolic Tangent</h1>
<p>Tanh function has the same shape as Sigmoid, in fact is the same but it is mathematically shifted and it works better in most cases.<br /><strong>Uses:</strong> generally used in hidden layers as it outputs between -1 and 1, thus creating normalized outputs, making learning faster.<br /><strong>Overhead:</strong> very expensive, since it uses an exponential term.<br /><strong>Performance:</strong> similar to Sigmoid but with some added benefits<br /><strong>Advantages:</strong></p>
<ul>
<li><p>Outputs are between -1 and 1, that means that values won't explode.</p>
</li>
<li><p>It is differentiable at every point.</p>
</li>
<li><p>It is zero-centered, unlike Sigmoid.</p>
</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li><p>Vanishing gradients.</p>
</li>
<li><p>Gradients saturation.</p>
</li>
</ul>
<p><strong>Code:</strong>  </p>
<pre><code class="lang-python"><span class="hljs-comment"># Pytorch </span>
torch.nn.Tanh() 
<span class="hljs-comment"># Tensorflow </span>
tf.keras.activations.tanh()
</code></pre>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fuz2khss64owahi5vkohr.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fuz2khss64owahi5vkohr.png" alt="Alt Text" /></a></p>
<h1 id="heading-relu">ReLU</h1>
<p>ReLU, also called rectified linear unit is one of the most commonly used activations, both for its computational efficiency and its great performance. Multiple variations have been created to improve its flaws.<br /><strong>Uses:</strong> must be used in hidden layers as it provides better performance than tanh and Sigmoid, and is more efficient since it is computationally faster.<br /><strong>Overhead:</strong> Almost none, extremely fast.<br /><strong>Performance:</strong> great performance, recommended for most cases.<br /><strong>Advantages:</strong></p>
<ul>
<li><p>Adds non-linearity to the network.</p>
</li>
<li><p>Does not suffer from vanishing gradient.</p>
</li>
<li><p>Does not saturate.</p>
</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li><p>It suffers from dying ReLU</p>
</li>
<li><p>Not differentiable at x = 0</p>
</li>
</ul>
<p><strong>Code:</strong>  </p>
<pre><code class="lang-python"><span class="hljs-comment"># Pytorch </span>
torch.nn.ReLU() 
<span class="hljs-comment"># Tensorflow </span>
tf.keras.activations.relu()
</code></pre>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpnwwr2cs5ftohlpfua94.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpnwwr2cs5ftohlpfua94.png" alt="Alt Text" /></a></p>
<h1 id="heading-leaky-relu">Leaky Relu</h1>
<p>Given that ReLU suffers from the dying relu problem where negative values are rounded to 0. Leaky ReLU tries to diminish the problem by changing the 0 output by a very small value.<br /><strong>Uses:</strong> used in hidden layers.<br /><strong>Overhead:</strong> same as ReLU<br /><strong>Performance:</strong> great performance if the hyperparameter is chosen correctly<br /><strong>Advantages:</strong></p>
<ul>
<li>Similar to ReLU and fixes dying ReLU.</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li>New hyperparameter to tune.</li>
</ul>
<p><strong>Code:</strong>  </p>
<pre><code class="lang-python"><span class="hljs-comment"># Pytorch </span>
torch.nn.LeakyReLU(negative_slope=...) 
<span class="hljs-comment"># Tensorflow </span>
tf.keras.layers.LeakyReLU(alpha=...)
</code></pre>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnsb3l2b2pyntx7k1pcv8.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnsb3l2b2pyntx7k1pcv8.png" alt="Alt Text" /></a></p>
<h1 id="heading-parametric-relu">Parametric ReLU</h1>
<p>Takes the same idea as leaky ReLU but instead of predifining the leaky hyperparemeter, it is added as a parameter that must be learned.<br /><strong>Uses:</strong> used in hidden layers.<br /><strong>Overhead:</strong> a new parameter must be learned for each PreLU in the network.<br /><strong>Performance:</strong> bad on hidden layers, mostly used on output layers<br /><strong>Advantages:</strong></p>
<ul>
<li>Fixes the need of tuning an hyperparameter</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li>The parameter learned is not guaranteed to be the optimum, and it increases the overhead, so you might as well try some yourself with leaky.</li>
</ul>
<p><strong>Code:</strong>  </p>
<pre><code class="lang-python"><span class="hljs-comment"># Pytorch </span>
torch.nn.PReLU(x) 
<span class="hljs-comment"># Tensorflow </span>
tf.keras.layers.PReLU(x)
</code></pre>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnsb3l2b2pyntx7k1pcv8.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fnsb3l2b2pyntx7k1pcv8.png" alt="Alt Text" /></a></p>
<h1 id="heading-elu">ELU</h1>
<p>The ELU was introduced as another alternative to fix the issues that you can encounter with ReLU.<br /><strong>Uses:</strong> used in hidden layers<br /><strong>Overhead:</strong> computational expensive, it uses an exponential term<br /><strong>Performance:</strong> bad on hidden layers, mostly used on output layers<br /><strong>Advantages:</strong></p>
<ul>
<li><p>Similar to reLU.</p>
</li>
<li><p>Produces negative outputs.</p>
</li>
<li><p>Bends smoothly unlike leakyReLU.</p>
</li>
<li><p>Differentiable at x = 0</p>
</li>
</ul>
<p><strong>Common issues:</strong></p>
<ul>
<li>Additional hyperparameter</li>
</ul>
<p><strong>Code:</strong>  </p>
<pre><code class="lang-python"><span class="hljs-comment"># Pytorch </span>
torch.nn.ELU() 
<span class="hljs-comment"># Tensorflow </span>
tf.keras.activations.elu()
</code></pre>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fgc9skcsszpl77o5oxqhm.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fgc9skcsszpl77o5oxqhm.png" alt="Alt Text" /></a></p>
<h1 id="heading-other-alternatives">Other alternatives</h1>
<p>There are a lot of activations functions to cover them all in a single post. Here are some:</p>
<ul>
<li><p>SeLU</p>
</li>
<li><p>GeLU</p>
</li>
<li><p>CeLU</p>
</li>
<li><p>Swish</p>
</li>
<li><p>Mish</p>
</li>
<li><p>Softplus</p>
</li>
</ul>
<p><em>Note: if it ends with LU it usually comes from ReLU.</em></p>
<h1 id="heading-summary">Summary</h1>
<p>So... having so many choices, which activation should we use? As a <strong>rule of thumb</strong> you should always try using ReLU in the hidden layers, as it has a great performance with minimal computational overhead. After that (if you have enough computing power) you might want to try with some complex variations of ReLU or similar alternatives. I would never recommend using Sigmoid, Tanh or Sotfmax for any hidden layer. Sigmoid and Softmax should be used whenever we want probabilities outputs for a classification task. Finally, with the current progress and research in deep learning and AI surely new and better functions will appear, so keep an eye out.</p>
<p>Remember to <strong>try and experiment always</strong>, you never know which function will work better for a specific task.</p>
]]></content:encoded></item><item><title><![CDATA[How to make your code embarrassingly faster?]]></title><description><![CDATA[Amdahl’s law is most popular in the computer science community, named after Gene Amdahl in 1967. It is said that the more resources we add to a program the faster it goes, but how can we add more resources to the program? Imagine a program as a singl...]]></description><link>https://blog.pol.company/how-to-make-your-code-embarrassingly-faster</link><guid isPermaLink="true">https://blog.pol.company/how-to-make-your-code-embarrassingly-faster</guid><category><![CDATA[C++]]></category><category><![CDATA[openmp]]></category><category><![CDATA[Parallel Programming]]></category><dc:creator><![CDATA[Pol Monroig Company]]></dc:creator><pubDate>Wed, 08 Oct 2025 21:39:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/plwud_FPvwU/upload/269dc9a37babccdd1fd7d2bc6b9d61bf.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Amdahl’s law is most popular in the computer science community, named after Gene Amdahl in 1967. It is said that the more resources we add to a program the faster it goes, but how can we add more resources to the program? Imagine a program as a single task, if we could divide it into smaller tasks and those tasks into smaller tasks, in the end, we would have a program with a lot of tasks; if every task is independent of each other we could then assign each task to a different processor. The dividing of tasks into multiple processors is called <strong>parallel computing</strong>. The size of the tasks proportional to the number of tasks is called <strong>task granularity</strong>, the smaller the tasks the greater the granularity.</p>
<p>It is easier to see that if we run a lot of tasks concurrently we can save time since we can do a lot of things at the same time. Awesome right? Well… not always, first of all, when we create a new task there is an overhead we need to take into account and each task might be dependent on other tasks (e.g. Task A needs to be done before B) so we might not be able to parallelize everything. Finally, our computer can only handle a limited amount of concurrent tasks at the same time. So if you still think parallel computing is the best thing that ever happened continue reading and I will show you how to make your code embarrassingly parallel (despite the issues above).</p>
<h1 id="heading-openmp-introduction">OpenMP Introduction</h1>
<p>Now, there are multiple ways to parallelize your code, but I will focus on using the <strong>OpenMP API</strong> because it is easy to implement and to learn. Let’s see a simple example into how you can parallelize the sum of two vectors  </p>
<pre><code class="lang-cpp"><span class="hljs-function"><span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;<span class="hljs-keyword">int</span>&gt; <span class="hljs-title">sum</span><span class="hljs-params">(<span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;<span class="hljs-keyword">int</span>&gt; <span class="hljs-keyword">const</span>&amp; v1, <span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;<span class="hljs-keyword">int</span>&gt; <span class="hljs-keyword">const</span>&amp; v2)</span></span>{

    <span class="hljs-keyword">int</span> size = v1.size(); <span class="hljs-comment">// could have also done v2.size()</span>
    <span class="hljs-function"><span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;<span class="hljs-keyword">int</span>&gt; <span class="hljs-title">output</span><span class="hljs-params">(size)</span></span>; <span class="hljs-comment">// initialize output with zeros </span>

    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp parallel </span>
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp for </span>
    <span class="hljs-keyword">for</span>(<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; size; ++i){
      output[i] = v1[i] + v2[i];
    }

    <span class="hljs-keyword">return</span> output;

}
</code></pre>
<p>Su<a target="_blank" href="https://dev.to/polmonroig/how-to-make-your-code-embarrassingly-faster-mp4#conclusion">mmi</a>ng to vectors is simple, you just have to sum each element individually and write it into another vector. But what are those <code>#pragma omp (something)</code>? Those <code>#pragma omp</code> enable us to parallelize the code.</p>
<ul>
<li><p><code>##pragma omp parallel</code> creates a parallel region, all threads available execute it.</p>
</li>
<li><p><code>#pragma omp for</code> divides the loop into k different chunks and each thread executes a different chunk (task). This way each thread sums a different part of the vector. The sum of two vectors is what in parallel computing is called <strong>embarrassingly</strong> parallel, what I mean by that is that we can create the tasks as small as we want and they won’t have any dependencies (The sum of each element in the vector is <strong>independent</strong> of each other). So what consequences does this have? First of all, we can parallelize it as much as the hardware supports and we don’t have to worry about data being shared between threads.</p>
</li>
</ul>
<h1 id="heading-how-to-avoid-data-races">How to avoid data races</h1>
<p>Imagine that instead of summing two vectors we would want to know the sum of the elements of a single one. It is tempting to use the same OpenMP structure as before but that would cause a <strong>data race</strong>, a data race happens when two threads want to access the same data at the same time, this causes <strong>inconsistency</strong> in the results since the value will update always in a different order. To visualize this let's see how we could solve the issue.  </p>
<pre><code class="lang-cpp">    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;in&gt; v; 
    ...
    <span class="hljs-keyword">int</span> sum = <span class="hljs-number">0</span>;
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp parallel </span>
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp for </span>
    <span class="hljs-keyword">for</span>(<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; v.size(); ++i){
      <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp critical </span>
      sum += v[i];
    }
</code></pre>
<p>A simple way to do this is adding a <code>#pragma omp critical</code>, this sentence ensures that no thread will execute the same code at the same time, this way when a thread wants to update the sum it has to wait for the other thread to do it. A more efficient way to do it (causes less overhead) would be to replace atomic for critical. Either way, if you try to execute this code you’ll see that it does not go as fast as you would think because most of the time the threads are waiting. I wish there would be a way to make it faster…  </p>
<pre><code class="lang-cpp">    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;in&gt; v; 
    ...
    <span class="hljs-keyword">int</span> sum = <span class="hljs-number">0</span>;
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp parallel </span>
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp for reduction(+:sum)</span>
    <span class="hljs-keyword">for</span>(<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; v.size(); ++i){
      sum += v[i];
    }
</code></pre>
<p>In fact, there is, the <strong>reduction</strong> clause solves all our problems. When a thread is created a private copy of sum is created and it is initialized with a value of 0. This way each thread updates its own copy of sum, and in the end, the copies are summed. By now you may be wondering how can I make a variable explicitly <strong>private</strong> or <strong>shared</strong>; in the <code>#pragma omp for</code> statement you can add a <code>shared(var)</code> clause or a <code>private(var)</code> clause.</p>
<h1 id="heading-controlling-the-number-of-tasks">Controlling the number of tasks</h1>
<p>Until now we let OpenMP decide for us the tasks that are generated, and thus the number of <strong>threads</strong>, but how can we control this?  </p>
<pre><code class="lang-cpp">
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"omp.h"</span> <span class="hljs-comment">// remember to include the omp directive </span></span>

<span class="hljs-comment">// equivalent to setting the env bash var </span>
<span class="hljs-comment">// OMP_NUM_THREADS=N</span>
omp_set_num_threads(N);

<span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp parallel num_threads(N)</span>
</code></pre>
<p>The first way would be to set the <strong>environment variable</strong> <code>OMP_NUM_THREADS</code>, a second method is to set it with an omp function, this sets the number of threads for all parallel regions if you want to set the number of <strong>threads</strong> for a specific region you need to set it directly. But this only sets the number of threads it creates, what if we want to control the number of tasks.  </p>
<pre><code class="lang-cpp">
<span class="hljs-comment">// sum a vector from index begin to index end and return the value </span>
<span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">sum_vector</span><span class="hljs-params">(<span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;<span class="hljs-keyword">int</span>&gt; <span class="hljs-keyword">const</span>&amp; v, <span class="hljs-keyword">int</span> begin, <span class="hljs-keyword">int</span> end)</span></span>;

<span class="hljs-keyword">int</span> sum1, sum2; 

<span class="hljs-comment">// n is the size of the vector "v"</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp task </span>
sum1 = sum_vector(v, <span class="hljs-number">0</span>, n / <span class="hljs-number">2</span>); <span class="hljs-comment">// sum first half </span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp task </span>
sum2 = sum_vectors(v, n / <span class="hljs-number">2</span>, n); <span class="hljs-comment">// sum second half </span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> omp taskwait </span>
<span class="hljs-keyword">int</span> sum = sum1 + sum2;
</code></pre>
<p>Based on the sum example we can perform the sum of a vector by dividing it into two tasks the first tasks sums the first half and the second task sums the second half, this is what we are doing here, we are creating two tasks with the <strong>task pragma</strong>. Simple right? You might have notice something strange at the end of the second task (<code>#pragma omp taskwait</code>), task wait does what the name says, it waits for the current tasks to finish before continuing the execution (it works as a sort of explicit <strong>barrier</strong>). But why do we have to wait for them to finish? The problem is that if we don’t wait we might not have sum1 and sum2 available for the final sum.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>This has been a very short and limited introduction to OpenMP since they're a lot of things I wish I could have covered, but parallel computing is really an extensive topic. Nevertheless, I hope this introduction has opened your eyes on how you can improve your code with simple additions.</p>
]]></content:encoded></item><item><title><![CDATA[How to implement a simple lossless compression in C++]]></title><description><![CDATA[Compression algorithms are one of the most important computer science discoveries. It enables us to save data using less space and transfer it faster. Moreover, compression techniques are so enhanced that even lossy compressions give us an unnoticeab...]]></description><link>https://blog.pol.company/how-to-implement-a-simple-lossless-compression-in-c</link><guid isPermaLink="true">https://blog.pol.company/how-to-implement-a-simple-lossless-compression-in-c</guid><category><![CDATA[C++]]></category><category><![CDATA[algorithms]]></category><dc:creator><![CDATA[Pol Monroig Company]]></dc:creator><pubDate>Wed, 08 Oct 2025 21:33:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/-2np-ZIwMAA/upload/552414f83ae5b739b5adcff72439aafd.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Compression algorithms are one of the most important computer science discoveries. It enables us to save data using less space and transfer it faster. Moreover, compression techniques are so enhanced that even lossy compressions give us an unnoticeable loss of the data being managed. Nevertheless, we are not going to talk about lossy compression algorithms, but loss-less algorithms, in particular, a very famous one called Huffman Encoding. You may know it because it is used in the JPEG image compression. In this post we will discover the magic behind this compression algorithm, we will go step by step until we end up designing a very simple implementation in C++.</p>
<h1 id="heading-prefix-property">Prefix property</h1>
<p>Huffman encoding is a code system based on the prefix property. To encode a text we can move each distinct character of the text into a set, thus we will have a set of characters. To compress each symbol we need a function that is able to convert a character into code (e.g. a binary string). Given a set of symbols Σ we can define a function ϕ: Σ → {0,1}+ that maps each symbol into a code. The symbols in Σ contain the set of distinct characters in the text that needs to be compressed. The most simple prefix encoding would be to assign each letter a binary number, which is a simple ASCII to binary integer conversion. This is a very simple encoding since it is a function that maps a character to itself, but it surely does not compress at all. Prefix codes are very easy to decode, they only need to be read (left-to-right), this ensures us a decompression runtime complexity of O(n). A common way to represent this type of encoding is in a binary tree called the <strong>prefix tree</strong>.<br />For example, let's suppose we have the following set and encoding scheme.</p>
<ul>
<li><p><strong>Symbols:</strong> Σ = {A, B, C D}</p>
</li>
<li><p><strong>Encoding:</strong> ϕ(A) = 1, ϕ(B)=01, ϕ(C)=000, ϕ(D)=001 then we can represent it using this</p>
<p>  <img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fxveg9vsu3tso8jxv7s8y.png" alt="Alt Text" /></p>
</li>
</ul>
<p>As we can see in the tree, to decode/encode a text (e.g. 00010010….) we must <strong>traverse</strong> the tree until we find a leaf (where the character is found). If the current prefix is a 0 we must go left and if it is a 1 we must go right. That simple!</p>
<p>After creating the tree it is easier to save the equivalencies (code — character) in a simple table.</p>
<p>A prefix tree has the following properties:</p>
<ul>
<li><p>One leaf per symbol</p>
</li>
<li><p>Left edge labeled 0 and right edge labeled 1</p>
</li>
<li><p>Labels on the path from the root to a leaf specify the code for that leaf.</p>
</li>
</ul>
<h1 id="heading-encoding">Encoding</h1>
<p>Okay, so what do these strange prefix trees have to do we Huffman trees? Well, it turns out Huffman trees are prefix trees, but not just simple prefix trees, they represent the <strong>optimal prefix trees</strong>. Given a text, an optimal prefix code is a prefix code that minimizes the total number of bits needed to encode that text, in other words, it is the encoding that makes the text smaller (fewer bits = more compression). Note that if you are using a Huffman tree to compress data you should also save the tree in which it was encoded.</p>
<p>Now, how do we find this optimal tree? Well, we need to follow the following steps.</p>
<ol>
<li><p>Find the <strong>frequencies</strong> of each character and save them in a table</p>
</li>
<li><p>For each character, we create a prefix tree consisting of only the leaf node. This node should contain the value of the character and its frequency in the text.</p>
</li>
<li><p>We should have a list of trees now, one per character. Next, we are going to select the two <strong>smallest</strong> trees, we consider a tree to be smaller to another one if its frequency is lower (in case of a tie we select the one with fewer nodes), and we are going to <strong>merge</strong> them into one; that is one of the two should become the left subtree and one the right subtree, afterward, a new parent node is created.</p>
</li>
</ol>
<p>Well, that's it, after joining every tree you should be left with only one. If you were paying attention you must have noticed that I didn’t specify how to select the smaller tree from the list of all the trees. That is because it depends on the implementation. The fast way to do it is saving the trees in a MinHeap (priority queue in C++), each insertion and deletion in the heap has an O(log n) complexity but the lookup is always constant. Thus the total complexity of the encoding algorithm is O(n log n) because we must insert a new tree n times.</p>
<h1 id="heading-implementation">Implementation</h1>
<p>The Huffman compression algorithm is a greedy algorithm, that is it always tries to make the optimal choice in a local space, to implement we can create a class called HuffmanTree.</p>
<pre><code class="lang-cpp"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">HuffmanTree</span>{</span>
<span class="hljs-keyword">public</span>:

    HuffmanTree(<span class="hljs-keyword">char</span> v, <span class="hljs-keyword">int</span> w);

    HuffmanTree(HuffmanTree <span class="hljs-keyword">const</span>&amp; tree);

    HuffmanTree(HuffmanTree <span class="hljs-keyword">const</span>&amp; h1, HuffmanTree <span class="hljs-keyword">const</span>&amp; h2);

    <span class="hljs-keyword">bool</span> <span class="hljs-keyword">operator</span>&lt;(HuffmanTree <span class="hljs-keyword">const</span>&amp; other) <span class="hljs-keyword">const</span>;

<span class="hljs-keyword">private</span>:

    <span class="hljs-comment">// represents a value that will never be read;</span>
    <span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> NULL_VALUE = <span class="hljs-number">-1</span>;

    <span class="hljs-comment">// left subtree</span>
    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">shared_ptr</span>&lt;HuffmanTree&gt; left;

    <span class="hljs-comment">// right subtree</span>
    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">shared_ptr</span>&lt;HuffmanTree&gt; right;

    <span class="hljs-keyword">char</span> value; <span class="hljs-comment">// character, null if !isLeaf </span>
    <span class="hljs-keyword">int</span> weight; <span class="hljs-comment">// aka. frequency </span>
    <span class="hljs-keyword">int</span> size; <span class="hljs-comment">// aka. number of nodes </span>
    <span class="hljs-keyword">bool</span> isLeaf;
};
</code></pre>
<p>A HuffmanTree will contain, as we said before, the <strong>value</strong> (character), its <strong>weight</strong> (frequency), and the <strong>size</strong> (number of nodes). Finally, it also has a pointer to the left subtree and the right subtree, we used a shared pointer to promote modern C++ <strong>smart pointers</strong> and avoid worrying about memory leaks.</p>
<p>You may be wondering why would we want to implement three different <strong>constructors</strong>? Well, the first one creates a new tree with a given value and weight.</p>
<pre><code class="lang-cpp">

HuffmanTree::HuffmanTree(<span class="hljs-keyword">char</span> v, <span class="hljs-keyword">int</span> w){
    value = v;
    left = <span class="hljs-literal">nullptr</span>;
    right = <span class="hljs-literal">nullptr</span>;
    weight = w;
    size = <span class="hljs-number">1</span>;
    isLeaf = <span class="hljs-literal">true</span>;
}
</code></pre>
<p>The second constructor is just a copy constructor, that creates a new one based on the old one.</p>
<pre><code class="lang-cpp">HuffmanTree::HuffmanTree(HuffmanTree <span class="hljs-keyword">const</span>&amp; tree){
    value = tree.value;
    left = tree.left;
    right = tree.right;
    weight = tree.weight;
    size = tree.size;
    isLeaf = tree.isLeaf;
}
</code></pre>
<p>Finally, we need a constructor that merges two different trees.</p>
<pre><code class="lang-cpp">HuffmanTree::HuffmanTree(HuffmanTree <span class="hljs-keyword">const</span>&amp; h1, HuffmanTree <span class="hljs-keyword">const</span>&amp; h2) {
    left = <span class="hljs-built_in">std</span>::make_shared&lt;HuffmanTree&gt;(h1);
    right =  <span class="hljs-built_in">std</span>::make_shared&lt;HuffmanTree&gt;(h2);
    size = left-&gt;size  + right-&gt;size;
    weight = left-&gt;weight + right-&gt;weight;
    isLeaf = <span class="hljs-literal">false</span>;
    value = NULL_VALUE;
}
</code></pre>
<p>The HuffmanTree class has overloaded a comparison operator, but if you were paying attention, it should be self-explanatory.</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">bool</span> HuffmanTree::<span class="hljs-keyword">operator</span>&lt;(HuffmanTree <span class="hljs-keyword">const</span>&amp; other) <span class="hljs-keyword">const</span>{
    <span class="hljs-keyword">if</span>(weight != other.weight)<span class="hljs-keyword">return</span> weight &lt; other.weight;
    <span class="hljs-keyword">else</span> <span class="hljs-keyword">return</span> size &lt; other.size;
}
</code></pre>
<p>Finally, we need to make the core of the algorithm, as you can see we first create a HuffmanTree per character, then we merge trees until we are only left with one.</p>
<pre><code class="lang-cpp">

    ...

    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">priority_queue</span>&lt;HuffmanTree&gt; minHeap;

    <span class="hljs-keyword">for</span>(<span class="hljs-keyword">auto</span> <span class="hljs-keyword">const</span>&amp; letter : table){
        minHeap.push(HuffmanTree(letter.first, letter.second)); <span class="hljs-comment">// first == char, second == frequency </span>
    }

    <span class="hljs-comment">// join trees</span>
    <span class="hljs-keyword">while</span>(minHeap.size() &gt; <span class="hljs-number">1</span>){
        HuffmanTree min1 = minHeap.top();
        minHeap.pop();
        HuffmanTree min2 = minHeap.top();
        minHeap.pop();
        minHeap.push(HuffmanTree(min1, min2));
    }
</code></pre>
<p>That’s all, you have successfully implemented a Huffman Tree, I hope you haven’t lost in the way!</p>
<p>Any doubts, please comment.</p>
]]></content:encoded></item><item><title><![CDATA[When accuracy is not enough...]]></title><description><![CDATA[The task of classification has existed long before the invention of machine learning. A problem that may arise when working with different algorithms is the use of an error function that determines if an algorithm is good enough, with classification ...]]></description><link>https://blog.pol.company/when-accuracy-is-not-enough</link><guid isPermaLink="true">https://blog.pol.company/when-accuracy-is-not-enough</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Pol Monroig Company]]></dc:creator><pubDate>Wed, 08 Oct 2025 21:29:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759958896013/19c51889-5aea-4065-870d-411c0069a128.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The task of classification has existed long before the invention of machine learning. A problem that may arise when working with different algorithms is the use of an <strong>error function</strong> that determines if an algorithm is good enough, with classification algorithms it is no different.</p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Folg2547hcyx52meiff0l.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Folg2547hcyx52meiff0l.png" alt="Alt Text" /></a></p>
<p>One of the most used metrics applied in these algorithms is the <strong>accuracy metric</strong>; based on the total number of samples and the predictions made, we return the <strong>percentage of samples</strong> that were correctly classified. But this method does not always work so well; imagine that we have a total of 1000 samples, and an algorithm called <em>DummyAlgorithm</em> that tries to classify them in two different classes (A and B). Unfortunately, DummyAlgorithm does not know anything about the data distribution, as a result, it always tells us that a given sample is of type A. Now imagine that all the samples are of class A (you might see where I'm going). In this case, it is easy to see that even though DummyAlgorithm has a 100% accuracy rate, it is not a very good algorithm.</p>
<p>In this post, we'll learn how we can complement the accuracy metric with other machine learning strategies that do take into account the problem described before. Consequently we'll see a method to avoid such a problem.</p>
<h1 id="heading-definitions">Definitions</h1>
<p>Before going any further, let's define some basic concepts.</p>
<p><strong>Accuracy:</strong> metric that returns the percentage of correctly classified samples in a dataset</p>
<p><strong>True Positives:</strong> samples that were correctly classified with their respective positive class</p>
<p><strong>True Negatives:</strong> samples that were correctly classified with their respective negative class</p>
<p><strong>False Positives:</strong> samples that were classified as positives but were negatives</p>
<p><strong>False Negatives:</strong> samples that were classified as negatives but were positives</p>
<p><strong>Precision:</strong> accuracy of the true positives (TP / TP + FP)</p>
<p><strong>Recall:</strong> ratio of positive instances that are correctly classified (TP / TP + FN)</p>
<p><em>Note: when we talk about positives/negatives, we are talking about a specific class</em></p>
<h1 id="heading-confusion-matrix">Confusion Matrix</h1>
<p>The confusion matrix creates a division for each of the four possible categorizations. It can be used in multiclass classification. In the following example we are making a binary classification that classifies red dots among other colors.</p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh7seoqf0t45tjtw12hmz.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fh7seoqf0t45tjtw12hmz.png" alt="Alt Text" /></a></p>
<h1 id="heading-precision-vs-recall-tradeoff">Precision vs recall tradeoff</h1>
<p>As with other metrics, the classifier has to make a decision in which if it wants to learn to have a better precision or a better recall. <em>Sometimes you care more about precision than you care about recall</em>. For example, if you wish to detect safe for work posts in a social network, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision). On the other hand, suppose you train a classifier to detect shoplifters, it is probably better that the classifier has the most recall as possible (the security system will get some false alerts, but almost all shoplifters will get caught.</p>
<p>Based on this tradeoff we can define a curve called the <strong>precision/recall curve</strong></p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ft16pwipspe97wd6jei4t.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ft16pwipspe97wd6jei4t.png" alt="Alt Text" /></a></p>
<h1 id="heading-roc-curve">ROC curve</h1>
<p>The ROC curve (receiver operating characteristic curve) is a very common tool used with binary classifiers. It is very similar to the precision/recall curve, but it plots the <strong>true positive rate</strong> against the <strong>false positive rate</strong>. One way to compare classifiers is to measure the <strong>area under the curve</strong> (AUC). A perfect classifier will have a AUC equal to 1. A purely random classifier will have a ROC AUC equal to 0.5.</p>
<p><a target="_blank" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fbo9nz5shg2bs9nttsogk.png"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fbo9nz5shg2bs9nttsogk.png" alt="Alt Text" /></a></p>
<p>As the ROC curve and the precision/recall curve are very similar, it might be difficult to choose between them. A common approach is to use the precision/recall curve whenever the positive class is rare and when you care more about the false positives than the false negatives, and the ROC curve otherwise.</p>
<h1 id="heading-solutions">Solutions</h1>
<p>The accuracy problem essentially happens when the data the model is being tested with is unbalanced. To solve this issue there are several approaches.</p>
<ul>
<li><p>If you have a lot of training data you can discard some of it to create a more balanced data, although your model might generalize worse with less data, this approach must be used in special cases.</p>
</li>
<li><p>Use a data augmentation technique to increase the data available.</p>
</li>
<li><p>Use a resampling technique in which you make the training data bigger by using the same data, useful if the data augmentation approach is too complicated.</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>