However, it does make some strict assumptions about the type of data it can model, as depicted below.

These conditions often restrict its applicability to data situations that do not obey the above assumptions.

**That is why being aware of its extensions is immensely important.**

**Generalized linear models (GLMs) precisely do that.**

They relax the assumptions of linear regression to make linear models more adaptable to real-world datasets.

If you are interested in learning more about this, I once wrote a detailed guide on this topic here: **Generalized Linear Models (GLMs): The Supercharged Linear Regression.**

Linear regression is pretty restricted in terms of the kind of data it can model.

For instance, its assumed data generation process looks like this:

Firstly, it assumes that the conditional distribution of Y given X is a Gaussian.

Next, it assumes a very specific form for the mean of the above Gaussian. It says that the mean should always be a linear combination of the features (or predictors).

Lastly, it assumes a constant variance for the conditional distribution P(Y|X) across all levels of X. A graphical way of illustrating this is as follows:

These conditions often restrict its applicability to data situations that do not obey the above assumptions.

In other words, nothing stops real-world datasets from violating these assumptions.

In fact, in many scenarios, the data might exhibit complex relationships, heteroscedasticity (varying variance), or even follow entirely different distributions altogether.

Yet, if we intend to build **linear models**, we should formulate better algorithms that can handle these peculiarities.

**Generalized linear models (GLMs) precisely do that.**

They relax the assumptions of linear regression to make linear models more adaptable to real-world datasets.

More specifically, they consider the following:

What if the distribution isn’t normal

**but some other distribution?**

What if X has a more sophisticated relationship with the mean?

What if the variance varies with X?

The effectiveness of a specific GLM — **Poisson regression** over linear regression is evident from the image below:

Linear regression assumes the data is drawn from a Gaussian, when in reality, it isn’t. Hence, it underperforms.

Poisson regression adapts its regression fit to a non-Gaussian distribution. Hence, it performs significantly better.

If you are interested in learning more about this, I once wrote a detailed guide on this topic here: **Generalized Linear Models (GLMs): The Supercharged Linear Regression.**

It covers:

How does linear regression model data?

The limitations of linear regression.

What are GLMs?

What are the core components of GLMs?

How do they relax the assumptions of linear regression?

What are the common types of GLMs?

What are the assumptions of these GLMs?

How do we use maximum likelihood estimates with these GLMs?

**How to build a custom GLM for your own data?**Best practices and takeaways.

**👉 Interested folks can read it here: Generalized Linear Models (GLMs): The Supercharged Linear Regression.**

**👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.**

Thanks for reading :)

Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:

**[FREE]**A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 If you love reading this newsletter, feel free to share it with friends!

]]>This creates problems because:

It wastes storage space.

It can lead to incorrect data analysis.

It can result in reporting errors and more.

As you are reading this, you might be thinking that we can remove duplicates by using common methods like `df.drop_duplicates()`

.

**But what if the data has fuzzy duplicates?**

Fuzzy duplicates are those records that are not exact copies of each other, but somehow, they appear to be the same. This is shown below:

The Pandas method will be ineffective because it will only remove exact duplicates.

So what can we do here?

Let’s imagine that your data has over a **million records**.

**How would you identify fuzzy duplicates in this dataset?**

One way could be to naively compare every pair of records, as depicted below:

We can formulate a distance metric for each field and generate a similarity score for each pair of records.

But this approach is infeasible at scale.

For instance, on a dataset with just a million records, comparing every pair of records will result in 10^12 comparisons (n^2).

Even if we assume a decent speed of 10,000 comparisons per second, this approach will take ~3 years to complete.

Can we do better?

Of course, we can!

But first, we need to understand a special property of fuzzy duplicates.

If two records are duplicates, they will certainly possess some lexical (or textual) overlap.

For instance, consider the below dataset:

Here, comparing the name “Daniel” to “Philip” or “Shannon” to “Julia” makes no sense. There is literally no lexical overlap.

Thus, they are guaranteed to be distinct records.

This makes intuitive sense as well.

**If we are calling two records as “duplicates,” there must be some lexical overlap.**

Yet, the naive approach will still waste time in comparing them.

We can utilize this “lexical overlap” property of duplicates to cleverly reduce the total comparisons.

More specifically, we segregate the data into smaller buckets by applying some rules.

For instance, consider the above dataset again. One rule could be to create buckets based on the first three letters of the first name.

Thus, we will only compare two records if they are in the same bucket.

If the first three letters are different, the records will fall into different buckets. Thus, they won’t be compared at all.

After segregating the records, we would have **eliminated almost 98-99%** of unnecessary comparisons that would have happened otherwise.

The figure “98-99%” comes from my practical experience on solving this problem on a dataset of such massive size.

Finally, we can use our naive comparison algorithm on each individual bucket.

The optimized approach can run in just a few hours instead of taking years.

This way, we can drastically reduce the run time and still achieve great deduplication accuracy.

Isn’t that cool?

Of course, we would have to analyze the data thoroughly to come up with the above data split rules.

But what is a more wise thing to do:

Using the naive approach, which takes three years to run, OR,

Spending some time analyzing the data, devising rules, and running the deduplication approach in a few hours?

**👉 **Over to you: Can you further optimize this approach?

**👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.**

**The button is located towards the bottom of this email.**

Thanks for reading!

Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:

**[FREE]**A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 If you love reading this newsletter, feel free to share it with friends!

]]>In most cases, it is **unknown **to us beforehand **why** values are missing.

There could be multiple reasons for missing values. Given that we have already covered this in detail in an earlier issue, so here’s a quick recap:

Please read this issue for more details: The First Step Towards Missing Data Imputation Must NEVER be Imputation.

**Missing Completely at Random (MCAR):**The value is genuinely missing by itself and has no relation to that or any other observation.**Missing at Random (MAR):**Data is missing due to another observed variable. For instance, we may observe that the percentage of missing values differs significantly based on other variables.

**Missing NOT at Random (MNAR):**This one is tricky. MNAR occurs when there is a definite pattern in the missing variable. However, it is unrelated to any feature we can observe in our data. In fact, this may depend on an unobserved feature.

Please read this issue for more details: The First Step Towards Missing Data Imputation Must NEVER be Imputation.

Identifying the reason for missingness can be extremely useful for further analysis, imputation, and modeling.

Today, let’s understand how we can enrich our missing value analysis with heatmaps.

Consider we have a daily sales dataset of a store that has the following information:

Day and Date

Store opening and closing time

Number of customers

Total sales

Account balance at open and close time

We can clearly see some missing values, but the reason is unknown to us.

Here, when doing EDA, many folks compute the **column-wise missing frequency** as follows:

The above table just highlights the number of missing values in each column.

More specifically, we get to know that:

Missing values are relatively high in two columns compared to others.

Missing values in the opening and closing time columns are the same (53).

That’s the only info it provides.

However, the problem with this approach is that it hides many important details about missing values, such as:

Their specific location in the dataset.

Periodicity of missing values (if any).

Missing value correlation across columns, etc.

…which can be extremely useful to understand the reason for missingness.

To put it another way, the above table is more like **summary statistics**, which rarely depict the true picture.

Why?

We have already discussed this a few times before in this newsletter, such as here and here, and below are the visuals from these posts:

So here’s how I often enrich my missing value analysis with heatmaps.

Compare the missing value table we discussed above with the following heatmap of missing values:

The white vertical lines depict the **location of missing values** in a specific column.

Now, it is immediately clear that:

Values are periodically missing in the opening and closing time columns.

Missing values are correlated in the opening and closing time columns.

The missing values in other columns

**appear to be**(not necessarily though) missing completely at random.

Further analysis of the opening time lets us discover that the store always remains closed on Sundays:

Now, we know why the opening and closing times are missing in our dataset.

This information can be beneficial during its imputation.

**This specific situation is “Missing at Random (MAR).”**

Essentially, as we saw above, the missingness is driven by the value of another observed column.

Now that we know the reason, we can use relevant techniques to impute these values if needed.

For MAR specifically, techniques like kNN imputation, Miss Forest, etc., are quite effective. We covered them in these issues:

Wasn’t that helpful over naive “missing-value-frequency” analysis?

**👉 **Over to you: What are some other ways to improve missing data analysis?

**👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.**

**The button is located towards the bottom of this email.**

Thanks for reading!

Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:

**[FREE]**A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 If you love reading this newsletter, feel free to share it with friends!

]]>Today, we are continuing with LLMs and learning LoRA for fine-tuning LLMs: **Implementing LoRA From Scratch for Fine-tuning LLMs**.

How it works?

Why is it effective and cost savings over traditional fine-tuning?

How to implement it from scratch?

How to use Hugginface PEFT to fine-tune any model using LoRA.

Read it here: **Implementing LoRA From Scratch for Fine-tuning LLMs**.

In the pre-LLM era, whenever someone open-sourced any high-utility model for public use, in most cases, practitioners would fine-tune that model to their specific task.

Note: Of course, it’s not absolutely necessary for a model to be open-sourced for fine-tuning if the model inventors provide API-based fine-tuning instead and decide to keep the model closed.

Fine-tuning means adjusting the weights of a **pre-trained model** on a new dataset for better performance. This is neatly depicted in the animation below.

While this fine-tuning technique has been successfully used for a long time, problems arise when we use it on much larger models — LLMs, for instance, primarily because of their size.

Consider **GPT-3**, for instance. It has 175B parameters, which is 510 times bigger than even the larger version of BERT called **BERT-Large**:

I have successfully fine-tuned BERT-large in many of my projects on a single GPU cluster, like in this paper and this paper.

But it would have been impossible to do the same with GPT-3.

Bringing in GPT-4 makes it even more challenging:

Traditional fine-tuning is just not practically feasible here, and in fact, not everyone can afford to do it due to a lack of massive infrastructure.

In fact, it’s not just about the availability of high computing power.

Consider this...

OpenAI trained GPT-3 and GPT-4 models in-house on massive GPU clusters, so they have access to them for sure.

They also provide a fine-tuning API to customize these models on custom datasets.

Going by traditional fine-tuning, for every customer wanting to have a customized version of any of these models, OpenAI would have to dedicate an entire GPU server to load it and also ensure that they maintain sufficient computing capabilities for fine-tuning requests.

To put it into perspective, a GPT-3 model checkpoint is estimated to consume about 350GBs.

And this is the static memory of the model, which only includes model weights. It does not even consider the memory required during training, computing activations, running backpropagation, and more.

To make things worse, what we discussed above is just for one customer, but they already have thousands of customers who create a customized version of OpenAI models that are fine-tuned to their dataset.

From this discussion, it must be clear that such scenarios pose a significant challenge for traditional fine-tuning approaches.

The computational resources and time required to fine-tune these large models for individual customers would be immense.

Additionally, maintaining the infrastructure to support fine-tuning requests from potentially thousands of customers simultaneously would be a huge task for them.

**LoRA (and QLoRA) are two superb techniques to address this practical limitation.**

The core idea revolves around smartly **training very few parameters** in comparison to the base model:

LoRA results show that one can achieve as good performance as full fine-tuning by learning even less than 1% of original parameters, which is excellent.

Read the article here for more details and how to implement it from scratch: **Implementing LoRA From Scratch for Fine-tuning LLMs.**

Thanks for reading!

]]>The data type assigned by default is usually 64-bit or 32-bit, when there is also scope for 16-bit, for instance. This is also evident from the code below:

As a result, we are not entirely optimal at efficiently allocating memory.

Of course, this is done to ensure better precision in representing information.

However, this precision always comes at the cost of additional memory utilization, which may not be desired in all situations.

In fact, it is also observed that many tensor operations, **especially matrix multiplication**, are much faster when we operate under smaller precision data types than larger ones, as demonstrated below:

Moreover, since `float16`

is only half the size of `float32`

, its usage reduces the memory required to train the network.

This also allows us to train larger models, train on larger mini-batches (resulting in even more speedup), etc.

**Mixed precision training** is a pretty reliable and **widely adopted** technique in the industry to achieve this.

As the name suggests, the idea is to employ lower precision `float16`

(wherever feasible, like in convolutions and matrix multiplications) along with `float32`

— that is why the name “mixed precision.”

This is a list of some models I found that were trained using mixed precision:

It’s pretty clear that mixed precision training is much more popularly used, but we don’t get to hear about it often.

Before we get into the technical details…

From the above discussion, it must be clear that as we use a low-precision data type (`float16`

), we might unknowingly introduce some numerical inconsistencies and inaccuracies.

To avoid them, there are some best practices for mixed precision training that I want to talk about next, along with the code.

Leveraging mixed precision training in PyTorch requires a few modifications in the existing network training implementation.

Consider this is our current PyTorch model training implementation:

The first thing we introduce here is a `scaler`

object that will scale the loss value:

We do this because, at times, the original loss value can be so low, that we might not be able to compute gradients in `float16`

with full precision.

Such situations may not produce any update to the model’s weights.

Scaling the loss to a higher numerical range ensures that even small gradients can contribute to the weight updates.

But these minute gradients can only be accommodated into the weight matrix when the weight matrix itself is represented in high precision, i.e., `float32`

.

**Thus, as a conservative measure, we tend to keep the weights in **`float32`

**.**

That said, the loss scaling step is not entirely necessary because, in my experience, these little updates **typically** appear towards the end stages of the model training.

Thus, it can be fair to assume that small updates may not drastically impact the model performance.

But don’t take this as a definite conclusion, so it’s something that I want you to validate when you use mixed precision training.

Moving on, as the weights (which are matrices) are represented in `float32`

, we can not expect the speedup from representing them in `float16`

, if they remain this way:

To leverage these `flaot16`

-based speedups, here are the steps we follow:

We make a

`float16`

copy of weights during the forward pass.Next, we compute the loss value in

`float32`

and scale it to have more precision in gradients, which works in`float16`

.The reason we compute gradients in float16 is because, like forward pass, gradient computations also involve matrix multiplications.

Thus, keeping them in

`float16`

can provide additional speedup.

Once we have computed the gradients in

`float16`

, the heavy matrix multiplication operations have been completed. Now, all we need to do is update the original weight matrix, which is in`float32`

.Thus, we make a

`float32`

copy of the above gradients, remove the scale we applied in Step 2, and update the`float32`

weights.Done!

The mixed-precision settings in the forward pass are carried out by the `torch.autocast()`

context manager:

Now, it’s time to handle the backward pass.

Line 13 →

`scaler.scale(loss).backward()`

: The`scaler`

object scales the loss value and`backward()`

is called to compute the gradients.Line 14 →

`scaler.step(opt)`

: Unscale gradients and update weights.Line 15 →

`scaler.update()`

: Update the scale for the next iteration.Line 16 →

`opt.zero_grad()`

**:**Zero gradients.

Done!

The efficacy of mixed precision scaling over traditional training is evident from the image below:

Mixed precision training is over 2.5x faster than conventional training.

Isn’t that cool?

Refer to this PyTorch documentation page for more code-related details: PyTorch Automated Mixed Precision Training.

Another pretty useful way to speed up model training is using Momentum. We covered it recently in this newsletter issue: An Intuitive and Visual Demonstration of Momentum in Machine Learning.

**👉 **Over to you: What are some other reliable ways to speed up machine learning model training?

**👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.**

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>In other words, the query data point must be matched across **all** data points to find the nearest neighbor(s).

This is highly inefficient, especially when we have many data points and a near-real-time response is necessary.

That is why **approximate nearest neighbor search algorithms** are becoming increasingly popular.

The core idea is to narrow down the search space using indexing techniques, thereby improving the overall run-time performance.

**Inverted File Index (IVF)** is possibly one of the simplest and most intuitive techniques here, which you can immediately start using.

Here’s how it works:

Given a set of data points in a high-dimensional space, the idea is to organize them into different partitions, typically using clustering algorithms like k-means.

As a result, each partition has a corresponding centroid, and every data point gets associated with **only one** partition corresponding to its nearest centroid.

Also, every centroid maintains information about all the data points that belong to its partition.

Indexing done!

Here’s how we search.

When searching for the nearest neighbor(s) to the query data point, instead of searching across the entire dataset, we first find the closest **centroid** to the query:

Once we find the nearest centroid, the nearest neighbor is searched in only those data points that belong to the closest partition found:

Let’s see how the run-time complexity stands in comparison to traditional kNN.

Consider the following:

There are N data points

Each data point is D dimensional

We create K partitions.

Lastly, for simplicity, let’s assume that each partition gets equal data points.

In kNN, the query data point is matched to all N data points, which makes the time complexity → O(ND).

In IVF, however, there are two steps:

Match to all centroids →

**O(KD)**.Find the nearest neighbor in the nearest partition →

**O(ND/K)**.

The final time complexity comes out to be the following:

…which is significantly lower than that of kNN.

To get some perspective, assume we have 10M data points. The search complexity of kNN will be proportional to **10M**.

But with IVF, say we divide the data into 100 centroids, and each partition gets roughly 100k data points.

Thus, the time complexity comes out to be proportional to 100 + 100k = 100100, which is nearly **100 times faster**.

Of course, it is essential to note that if some data points are actually close to the input data point but still happen to be in the neighboring partition, we will miss them during the nearest neighbor search, as shown below:

But this accuracy tradeoff is something we willingly accept for better run-time performance, which is precisely why these techniques are called “**approximate** nearest neighbors search.”

In one of the recent deep dives on vector databases (not paywalled), we discussed 4 such techniques, along with an entirely beginner-friendly and thorough discussion on **Vector Databases**.

Check it out here if you haven’t already: **A Beginner-friendly and Comprehensive Deep Dive on Vector Databases**.

Moreover, around 6 weeks back, we discussed two powerful ways to supercharge kNN models in this newsletter. Read it here: **Two Simple Yet Immensely Powerful Techniques to Supercharge kNN Models**.

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>In other words, there is a high percentage of neurons, which, if removed from the trained network, will not affect the performance remarkably:

And, of course, I am not saying this as a random and uninformed thought.

I have experimentally verified this over and over across my projects.

Here’s the core idea.

After training is complete, we run the dataset through the model (no backpropagation this time) and analyze the **average activation** of individual neurons.

Here, we often observe that many **neuron activations** are always close to near-zero values.

Thus, they can be pruned from the network, as they will have very little impact on the model’s output.

For pruning, we can decide on a pruning threshold (λ) and prune all neurons whose activations are less than this threshold.

This makes intuitive sense as well.

More specifically, if a neuron **rarely** possesses a high activation value, then it is fair to assume that it isn’t contributing to the model’s output, and we can safely prune it.

The following table compares the accuracy of the pruned model with the original (full) model across a range of pruning thresholds (λ):

Notice something here.

At a pruning threshold **λ=0.4**, the validation accuracy of the model drops by just 0.62%, but **the number of parameters drops by 72%**.

That is a huge reduction, while both models being almost equally good!

Of course, there is a trade-off because we are not doing as well as the original model.

But in many cases, especially when deploying ML models, accuracy is not the only primary metric that decides these.

Instead, several operational metrics like **efficiency**, **speed**, **memory consumption**, etc., are also a key deciding factor.

That is why model compression techniques are so crucial in such cases.

If you want to learn more, we discussed them in this deep dive: **Model Compression: A Critical Step Towards Efficient Machine Learning**.

While we only discussed one such technique today (activation pruning), the article discusses 6 model compression techniques, with PyTorch implementation.

**👉 **Over to you: What are some other ways to make ML models more production-friendly?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>And, of course, there are various ways to speed up model training, like:

Batch processing

Leverage distributed training using frameworks like PySpark MLLib.

Use better Hyperparameter Optimization, like Bayesian Optimization, which we discussed here: Bayesian Optimization for Hyperparameter Tuning.

and many other techniques.

**Momentum** is another reliable and effective technique to speed up model training.

While Momentum is pretty popular, many people struggle to intuitively understand how it works and why it is effective.

Let’s understand this today!

In gradient descent, every parameter update solely depends on the current gradient.

This is clear from the gradient weight update rule shown below:

As a result, we end up having many unwanted oscillations during the optimization process.

Let’s understand this more visually.

Imagine this is the loss function contour plot, and the optimal location (parameter configuration where the loss function is minimum) is marked here:

Simply put, this plot illustrates how gradient descent moves towards the optimal solution. At each iteration, the algorithm calculates the gradient of the loss function at the current parameter values and updates the weights.

This is depicted below:

Notice two things here:

It unnecessarily oscillates vertically.

It ends up at the non-optimal solution after some epochs.

Ideally, we would have expected our weight updates to look this:

It must have taken longer steps in the horizontal direction…

…and smaller vertical steps because a movement in this direction is unnecessary.

This idea is also depicted below:

Momentum-based optimization slightly modifies the update rule of gradient descent.

More specifically, it also considers a **moving average** of past gradients:

This helps us handle the unnecessary vertical oscillations we saw earlier.

How?

As Momentum considers a **moving average of past gradients**, so if the recent gradient update trajectory looks as shown in the following image, then it is clear that its average in the vertical direction will be very low while that in the horizontal direction will be large (which is precisely what we want):

As this moving average gets **added** to the gradient updates, it helps the optimization algorithm take larger steps in the desired direction.

This way, we can:

Smoothen the optimization trajectory.

Reduce unnecessary oscillations in parameter updates, which also speeds up training.

This is also evident from the image below:

This time, the gradient update trajectory shows much smaller oscillations in the vertical direction, and it also manages to reach an optimum under the same number of epochs as earlier.

This is the core idea behind Momentum and how it works.

Of course, Momentum does introduce another hyperparameter (Momentum rate) in the model, which should be tuned appropriately like any other hyperparameter:

For instance, considering the 2D contours we discussed above:

Setting an extremely large value of Momentum rate will significantly expedite gradient update in the horizontal direction. This may lead to overshooting the minima, as depicted below:

What’s more, setting an extremely small value of Momentum will slow down the optimal gradient update, defeating the whole purpose of Momentum.

If you want to have a more hands-on experience, check out this tool: **Momentum Tool**.

**👉 **Over to you: What are some other reliable ways to speed up machine learning model training?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>To reiterate, we saw how it allows us to validate and control the attribute (as achieved by defining explicit setter and getter methods) using dot notation itself.

Today, I want to continue our discussion on yesterday’s topic and tell you a limitation of the above approach, which I did not cover yesterday.

Moving on, we shall see how **Descriptors** in Python provide a much more elegant way of setting and getting values.

Let’s begin!

Consider the above class implementation again:

The biggest issue here is that **we must define a getter and setter for every instance-level attribute**.

So what if our class has, say, 3 such attributes, and all must be positive?

Of course, we will have **3 getters** and **3 setters**, which makes the overall implementation long, messy, and redundant.

There’s redundancy because every setter method will have almost the same lines of code (the if statements for validation).

Also, if you think about it, the getter methods are somewhat redundant and unnecessary too, as they just return an attribute.

If that is clear, there’s one more issue with the above implementation.

Recall what I mentioned earlier: “Our class will have 3 such instance-level attributes, and **all must be positive**?”

See what happens when we create an object with an invalid input:

As depicted above, Python does not raise any error, when ideally, it should.

One common way programmers try to eliminate redundancy is by defining explicit validation functions.

For instance, we can define a function that just validates the value received, as demonstrated below:

Next, we can invoke this method wherever needed:

But this does not solve the problem either:

We still have explicit and redundant function calls in each setter method.

All getter methods still do the same thing and have high redundancy.

And most importantly, the

`__init__`

method is now messed up with multiple function calls.

Simply put, `Descriptors`

are objects with methods (like `__get__`

, `__set__`

, etc.) that are used to manage access to the attributes of **another class**.

So, every descriptor object is assigned to only one attribute of **another class**.

And just to be clear, this “another class” is the class we are primarily interested in — the `DummyClass`

we saw earlier, for instance.

Thus:

The attribute

`number1`

→ gets its own descriptor.The attribute

`number2`

→ gets its own descriptor.The attribute

`number3`

→ gets its own descriptor.

A typical `Descriptor`

class is implemented with three methods, as shown below:

The

`__set__`

method is called when the attribute is assigned a new value. We can define the custom checks here.The

`__set_name__`

method is called when the descriptor object is assigned to a class attribute. It allows the descriptor to keep track of the name of the attribute it’s assigned to within the class.The

`__get__`

method is called when the attribute is accessed.

Also:

The

`instance`

parameter refers to the object of the desired class —`DummyClass()`

.The

`owner`

parameter is the desired class itself —`DummyClass`

.The

`value`

parameter is the value being assigned to an attribute of the desired class.The

`name`

parameter is the name of the attribute.

If it’s unclear, let me give you a simple demonstration.

Consider this `Descriptor`

class:

I’ll explain this implementation shortly, but before that, let’s consider its usage, which is demonstrated below:

Now, let’s go back to the `DescriptorClass`

implementation:

`__set_name__(self, owner, name)`

: This method is called when the descriptor is assigned to a class attribute (line 3). It saves the name of the attribute in the descriptor for later use.`__set__(self, instance, value)`

: When a value is assigned to the attribute (line 6), this method is called. It raises an error if the value is negative. Otherwise, it stores the value in the instance’s dictionary under the attribute name we defined earlier.`__get__(self, instance, owner)`

: When the attribute is accessed, this method is called. It returns the value from the instance’s dictionary.

Done!

Now, see how this solution smartly solves all the problems we discussed earlier.

Let’s create an object of the `DummyClass`

:

As depicted above, assigning an invalid value to the attribute raises an error.

Next, let’s see what happens when the attribute specified during the initialization is invalid:

Great! It validates the initialization too.

Here, recall that we never defined any explicit checks in the `__init__`

method, which is super cool.

Moving on, let’s define multiple attributes in the `DummyClass`

now:

Creating an object and setting an invalid value for any of the attributes raises an error:

Works seamlessly!

Recall that we never defined multiple getters and setters for each attribute individually, like we did with the `@property`

decorator earlier.

This is great, isn’t it?

I find descriptors to be massively helpful in reducing work and code redundancy while also making the entire implementation much more elegant.

If you want to try them out, I prepared this notebook for you to get started: **Python Descriptors Notebook**.

Have fun playing around with them!

Also, here’s a full deep dive into Python OOP if you want to learn more about advanced OOP in Python: Object-Oriented Programming with Python for Data Scientists.

👉 Over to you: What are some cool things you know about Python OOP?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>**Yet, with dot notation, we can not validate the updates made to an attribute.**

This means we can assign invalid values to an instance’s attributes, as shown below:

One common way to avoid this is by defining a setter (`set_side()`

), which validates the assignment step.

But explicitly invoking a setter method isn’t as elegant as dot notation, is it?

Ideally, we would want to:

Use dot notation

and still apply those validation checks

The @𝐩𝐫𝐨𝐩𝐞𝐫𝐭𝐲 decorator in Python can help.

Here’s how we can use it here.

First, define a getter as follows:

Declare a method with the attribute’s name.

There’s no need to specify any parameters for this method.

Decorate it with the @𝐩𝐫𝐨𝐩𝐞𝐫𝐭𝐲 decorator.

Next, define a setter as follows:

Declare a method with the attribute’s name.

Specify the parameter you want to update the attribute with.

Write the conditions as you usually would in any other setter method.

Decorate it with the @𝐚𝐭𝐭𝐫𝐢𝐛𝐮𝐭𝐞-𝐧𝐚𝐦𝐞.𝐬𝐞𝐭𝐭𝐞𝐫 decorator.

Done!

Now, you can use the dot notation while still having validation checks in place.

This is demonstrated below:

This approach offers both:

The validation and control of explicit setters and getters.

The elegance of dot notations.

Isn’t that cool?

Here are some interesting reads around Python OOP, which we have covered in the past and you must read next:

Python Does Not Fully Deliver OOP Encapsulation Functionalities

The Most Common Misconception About __init__() Method in Python

One of the Most Critical Pillars of OOP is Missing from Python

Also, here’s a full deep dive into Python OOP if you want to learn more about advanced OOP in Python: Object-Oriented Programming with Python for Data Scientists.

👉 What are some cool things you know about Python OOP?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>For instance, consider fitting a polynomial regression model trained on this dummy dataset below:

In case you don’t know, this is called a polynomial regression model:

It is expected that as we’ll increase the degree (`m`

) and train the polynomial regression model:

The training loss will get closer and closer to zero.

The test (or validation) loss will first reduce and then get bigger and bigger.

This is because, with a higher degree, the model will find it easier to contort its regression fit through each training data point, which makes sense.

In fact, this is also evident from the following loss plot:

**But notice what happens when we continue to increase the degree (**`m`

**):**

That’s strange, right?

**Why does the test loss increase to a certain point but then decrease?**

This was not expected, was it?

Well…what you are seeing is called the “**double descent phenomenon**,” which is quite commonly observed in many ML models, especially deep learning models.

It shows that, counterintuitively, increasing the model complexity beyond the point of interpolation can improve generalization performance.

**In fact, this whole idea is deeply rooted to why LLMs, although massively big (billions or even trillions of parameters), can still generalize pretty well.**

And it’s hard to accept it because this phenomenon directly challenges the traditional bias-variance trade-off we learn in any introductory ML class:

Putting it another way, training very large models, **even with more parameters than training data points**, can still generalize well.

To the best of my knowledge, this is still an **open question**, and it isn’t entirely clear why neural networks exhibit this behavior.

There are some theories around regularization, however, such as this one:

It could be that the model applies some sort of **implicit regularization**, with which, it can precisely focus on an apt number of parameters for generalization.

But to be honest, nothing is clear yet.

👉 Over to you: I would love to hear from you today on what you think about this phenomenon and its possible causes.

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>Anyone who wants to learn its method is quite likely to be intimidated by its API reference topics:

If you are in a similar situation, I once prepared this NumPy cheat sheet, which depicts the 40 most commonly used methods from NumPy:

Having used NumPy for over 4.5 years, I can confidently say that you will use these methods 95% of the time working with NumPy.

It is important to understand that whenever you are learning a new library, mastering/practicing each and every method is not necessary.

Instead, put Pareto’s principle to work :)

20% of your inputs contribute towards generating 80% of your outputs.

👉 Over to you: Have I missed any commonly used method?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>In the responses, sooo many of you showed interest in learning about LLMs and vector databases.

Considering this, I have published a pretty extensive (~50 min read) and entirely beginner-friendly deep dive on this topic: **A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.**

Also, as so many of you wanted to learn about this topic, **I have decided NOT to paywall this article, and keep it open to all viewers.**

I am pretty confident that if you have no idea what vector databases are, how they work, why they are so powerful, etc., this deep dive will clear everything for you, with proper intuition.

We also do a practical demo of vector databases using Pinecone so that you get to have hands-on experience.

Please read it here: **A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.**

Before I end…

I urge you to fill it out the survey below if you haven’t already.

It really helps me learn more about who you are, what you do, what you look forward to in this newsletter, and how I can improve it:

I may have never published the above deep dive and instead sticked to more data science topics, had I not known that so many of you are interested in this topic.

Have a good day!

Avi

]]>Yet, they can be highly misleading at times.

Let’s understand how!

To begin, a box plot is a graphical representation of just five numbers:

min

first quartile

median

third quartile

max

This means that if two entirely different distributions have similar five values, they will produce identical box plots.

This is evident from the image below:

As depicted above, three datasets have the same box plots, but entirely different distributions.

This shows that solely looking at a bar plot may lead to incorrect or misleading conclusions.

Here, the takeaway is not that box plots should not be used.

Instead, it’s similar to what we saw in one of the earlier posts about correlation: **“Whenever we generate any summary statistic, we lose essential information.”**

Correlation is also a summary statistic, and as shown below, adding just two outliers changes the direction of correlation:

Thus, it is always important to look at the underlying data distribution.

For instance, whenever I create a box plot, I create a violin (or KDE) plot too. This lets me validate whether summary statistics resonate with the data distribution.

In fact, I also find **Raincloud plots** to be pretty useful.

They provide a pretty concise way to combine and visualize three different types of plots together.

These include:

Box plots for data statistics.

Strip plots for data overview.

KDE plots for the probability distribution of data.

👉 Over to you: What other measures do you take when using summary statistics?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>To recall, we discussed the motivation behind using KernelPCA over PCA for dimensionality reduction.

More specifically, we understood that even though our data is non-linear, PCA still produces a linear subspace for projection, but KernelPCA handles it well:

Today, I want to continue our discussion on PCA and highlight one common misconception about PCA.

Let’s begin!

PCA, by its very nature, is a **dimensionality reduction** technique.

Yet, at times, many use PCA for visualizing high-dimensional datasets. This is done by projecting the given data into two dimensions and visualizing it.

While this may appear like a fair thing to do, there’s a big problem here that often gets overlooked.

To understand this problem, we first need to understand a bit about how PCA works.

The core idea in PCA is to linearly project the data to another space using the **eigenvectors** of the covariance matrix.

Why eigenvectors?

It creates uncorrelated features, which is useful because if features are independent, the features with the least variance can be dropped for dimensionality reduction.

It ensures that new features collectively preserve the original data variance.

If you wish to learn the mathematical origin of eigenvectors in PCA, we covered it here: PCA article.

Coming back to the visualization topic…

As discussed above, after applying PCA, each new feature captures a fraction of the original data variance.

Thus, if we intend to use PCA for visualization by projecting the data to 2-dimensions…

…then this visualization will only be useful if the **first two principal components** collectively capture most of the original data variance.

If they don’t, then the two-dimensional visualization will be highly misleading and incorrect.

We can avoid this mistake by plotting a **cumulative explained variance (CEV) plot**.

As the name suggests, it plots the cumulative variance explained by principal components.

In sklearn, for instance, the explained variance fraction is available in the `explained_variance_ratio_`

attribute:

We can create a cumulative plot of explained variance and check whether the first two components explain the majority of variance.

For instance, in the plot below, the first two components only explain 55% of the original data variance.

Thus, visualizing this dataset in 2D using PCA may not be a good choice because plenty of data variance is missing.

However, in the below plot, the first two components explain **90%** of the original data variance.

Thus, using PCA for visualization looks like a fair thing to do.

As a takeaway, use PCA for 2D visualization only when the above plot suggests so.

If it does not, then refrain from using PCA for 2D visualization and use other techniques specifically meant for visualization, like t-SNE, UMAP, etc.

If you want to learn the entirety of t-SNE, we covered it here: **Formulating and Implementing the t-SNE Algorithm From Scratch**.

👉 Over to you: What are some other problems with using PCA for visualization?

I completed 500 days of writing this daily newsletter yesterday. To celebrate this, I am offering a limited-time 50% discount on full memberships.

**The offer ends in the next 10 hours.**

Also, I will publish an entirely beginner-friendly and extensive deep dive into vector databases tomorrow, which I wouldn’t recommend missing out on.

**Join here** or click the button below to join today:

Thanks!

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>Today is a special day as this newsletter has completed **500 days** of serving its readers.

It started on 3rd Oct 2022, and it’s unbelievable that we have come so far. Thanks so much for your consistent readership and support.

Today, I am offering a limited-time discount of 50% off on full memberships.

If you have ever wanted to join, this will be the perfect time, as **this discount will end in the next 36 hours**.

**Join here** or click the button below to join today:

Thanks, and let’s get to today’s post now!

During dimensionality reduction, principal component analysis (PCA) tries to find a low-dimensional **linear** subspace that the given data conforms to.

For instance, consider the following dummy dataset:

It’s pretty clear from the above visual that there is a linear subspace along which the data could be represented while retaining maximum data variance. This is shown below:

But what if our data conforms to a low-dimensional yet **non-linear** subspace.

For instance, consider the following dataset:

Do you see a low-dimensional **non-linear** subspace along which our data could be represented?

No?

Don’t worry. Let me show you!

The above curve is a continuous non-linear and low-dimensional subspace that we could represent our data given along.

Okay…so why don’t we do it then?

**The problem is that PCA cannot determine this subspace because the data points are non-aligned along a straight line.**

In other words, PCA is a linear dimensionality reduction technique.

Thus, it falls short in such situations.

Nonetheless, if we consider the above non-linear data, don’t you think there’s still some intuition telling us that this dataset can be reduced to one dimension **if we can capture this non-linear curve**.

**KernelPCA (or the kernel trick) precisely addresses this limitation of PCA.**

The idea is pretty simple:

Project the data to another high-dimensional space using a

**kernel function**, where the data becomes linearly representable. Sklearn provides a KernelPCA wrapper, supporting many popularly used kernel functions.Apply the standard PCA algorithm to the transformed data.

The efficacy of KernelPCA over PCA is evident from the demo below.

As shown below, even though the data is non-linear, PCA still produces a linear subspace for projection:

However, KernelPCA produces a non-linear subspace:

Isn’t that cool?

The catch is the run time.

Please note that the run time of PCA is already cubically related to the number of dimensions.

KernelPCA involves the kernel trick, which is quadratically related to the number of data points (`n`

).

Thus, it increases the overall run time.

This is something to be aware of when using KernelPCA.

**👉 **Over to you: What are some other limitations of PCA?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>While the motivation for Pandas and SQL is clear and well-known, let me tell you why you should care about Polars and PySpark.

Pandas has many limitations, which Polars addresses, such as:

Pandas always adheres to single-core computation → Polars is multi-core.

Pandas offers no lazy execution → Polars does.

Pandas creates bulky DataFrames → Polars’ DFs are lightweight.

Pandas is slow on large datasets → Polars is remarkably efficient.

In fact, if we look at the run-time comparison on some common operations, it’s clear that Polars is much more efficient than Pandas:

While tabular data space is mainly dominated by Pandas and Sklearn, one can hardly expect any benefit from them beyond some GBs of data due to their single-node processing.

A more practical solution is to use distributed computing instead — a framework that disperses the data across many small computers.

Spark is among the best technologies used to quickly and efficiently analyze, process, and train models on big datasets.

That is why most data science roles at big tech demand proficiency in Spark. It’s that important.

We covered this in detail in a recent deep dive as well: Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.

👉 Over to you: What are some other faster alternatives to Pandas that you are aware of?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>It serves as a standard interpreter for Python and offers no built-in optimization.

This profoundly affects the run-time performance of the program, especially when it’s all native Python code.

Today, I want to tell you about **Cython**, an optimized compiler, which addresses the limitations of the default interpreter of Python.

CPython and Cython are different. Don’t get confused between the two.

Let’s begin!

In a gist, **Cython** automatically converts your Python code into C, which is fast and efficient.

Here’s how we use it, say, in a Jupyter Notebook:

First, load the Cython extension (in a separate cell of the notebook):

Where we have native Python code, add the Cython magic command at the top of the cell:

If our code has functions, specify the data type of the parameters as follows:

Define every variable using the “

`cdef`

” keyword and specify its data type.

A sample conversion of a Python function is shown below:

Simple, isn’t it?

Once we do that, Cython will convert our Python code to C, as depicted below:

This will run at native machine code speed.

To do so, we can invoke the method as we usually would:

The speedup is evident from the image below:

Python code is slow.

But Cython provides over 100x speedup.

Essentially, Python is dynamic in nature.

For instance, we can define a variable of a specific type. But later, we can change it to some other type.

But these dynamic manipulations come at the cost of run time. They also introduce memory overheads.

However, Cython restricts Python’s dynamicity.

More specifically, we avoid the above overheads by explicitly specifying the variable data type.

The above declaration restricts the variable to a specific data type. This means the program would never have to worry about dynamic allocations.

This speeds up run-time and reduces memory overheads.

Isn’t that cool?

**I prepared this notebook for you to get started with Cython: Cython Jupyter Notebook.**

👉 Over to you: Can you tell some more limitations of Python’s default interpreter?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>This visual from ByteByteGo neatly summarizes them:

Here’s a short description:

**Inner Join**: Returns only matching rows from both tables.**Left Join**: Returns all rows from the left table and matching rows from the right. Non-matching rows from the left table contain null values.**Right Join**: Returns all rows from the right tables and matching rows from the left. Non-matching rows from the right table contain null values.**Full Outer Join**: Returns all rows when there is a match in either table. Non-matching rows contain null values.

These four are the most prevalent types of SQL joins.

But there are more.

And today, I want to introduce you to three of them which I find pretty handy at times.

**These are:**

**Semi Join****Anti Join****Natural Join**

If you already know them, you can stop reading here.

If not, let’s understand!

Semi-join appears quite similar to a left join, but there are three notable differences:

If the join condition between two rows is

`TRUE`

,**columns from only the left table are returned**. Compare this to the left join, which returns columns from both tables.

**If a row in the left table has no match, then that row is not returned**. In left join, however, all rows from the left table are returned irrespective of whether they have a match or not.

**If a row in the left table has multiple matches, only one entry is returned.**In left join, however, multiple matches are returned an equivalent number of times.

For instance, consider we have the following two tables:

Executing a semi-join, we get the following results:

As depicted above, unlike left-join:

It only returns columns from the left table.

It only returns the matched rows from the left table.

If a record has multiple matches, like in this case:

…then we notice that semi-join only returns one record from the left table:

I find semi-join to be particularly useful when I only care about the **existence** of records in another table. Left join returns duplicates which are not of interest at that point.

The rows discarded by the semi-join from the left table are the results of an anti-join.

So, in a way, we can say that:

Consider the above `orders`

and `users`

table again (*one in which there were no multiple matches*):

Executing an anti-join, we get the following results:

It is clear from the semi-join and anti-join results that:

Of course, the order can be different. When I say “`[SEMI JOIN] + [ANTI JOIN] = [LEFT TABLE]`

”, I mean the collection of all records.

I find anti-join to be particularly useful when I wish to know which records do not exist in another table.

This one is similar to INNER JOIN, but there’s no need to specify a join condition explicitly.

Instead, it** automatically considers a join condition on ALL the matching column names**.

Consider the users and orders table yet again:

Here, the `User_ID`

column is present in both tables.

Executing a natural join, we get the following results:

As depicted above, the results are similar to what we would get with INNER JOIN.

However, we did not have to explicitly specify a JOIN condition, which, admittedly, could be good or bad.

It is good because it helps us write concise queries.

It is bad because we are not explicit about the columns being joined.

These were three more types of SQL Joins, which I use at times to write concise and elegant SQL queries.

Hope you learned something new.

The visual below neatly summarizes what we discussed today:

If you wish to experiment with what we discussed today, download this Jupyter Notebook: **Semi-Anti-Natural Join Notebook**.

**👉 **Over to you: Have I missed any other type of SQL Join?

**The button is located towards the bottom of this email.**

Thanks for reading!

If you’re not a full subscriber, here’s what you missed last month:

You Are Probably Building Inconsistent Classification Models Without Even Realizing

Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?

PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.

How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.

Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.

DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.

You Cannot Build Large Data Projects Until You Learn Data Version Control!

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

**👉 Tell the world what makes this newsletter special for you by leaving a review here :)**

👉 If you love reading this newsletter, feel free to share it with friends!

]]>I’ve been writing for over 2 years now, first on Medium, then ghostwriting, and eventually, writing this daily newsletter full-time, alongside LinkedIn.

The joy is inexpressible.

I’m happy to share that this newsletter now has more than 63,000 subscribers. I will complete **500 days** of writing this newsletter in a week, and I am super excited about that.

I appreciate your support — by reading it every day, providing feedback, and sharing the newsletter with your friends.

**If you appreciate the quality here, then it’s primarily because of reader feedback.**

Today, I am not releasing an educational issue but instead seeking your help.

Lately, I have realized that I don’t happen to know much about you, what you do, what you look forward to in this newsletter, and what you want me to cover.

I genuinely want to solve your problems.

But in order to do that, I want to know the problems you are currently facing.

So, I would urge you to take a minute or two and respond to the survey below:

Your response will help me ensure I’m creating the best newsletter issues for you in the future.

Thank you so much for giving me a bit of your valuable time.

**You can respond anonymously if you wish to. But if you want me to contact you, please provide your email in the survey.**

Thank you so much!

Your biggest fan,

Avi

]]>