Machine Learning
1. What is overfitting?
Overfitting in machine learning occurs when your model is not generalized well. The model is too focused on the training set. It captures a lot of detail or even noise in the training set. Thus, it fails to capture the general trend or the relationships in the data. If a model is too complex compared to the data, it will probably be overfitting.
A strong indicator of overfitting is the high difference between the accuracy of training and test sets. Overfit models usually have very high accuracy on the training set but the test accuracy is usually unpredictable and much lower than the training accuracy.
2. How can you reduce overfitting?
We can reduce overfitting by making the model more generalized which means it should be more focused on the general trend rather than specific details.
If it is possible, collecting more data is an efficient way to reduce overfitting. You will be giving more juice to the model so it will have more material to learn from. Data is always valuable especially for machine learning models.
Another method to reduce overfitting is to reduce the complexity of the model. If a model is too complex for a given task, it will likely result in overfitting. In such cases, we should look for simpler models.
3. What is regularization?
We have mentioned that the main reason for overfitting is a model being more complex than necessary. Regularization is a method for reducing the model complexity.
It does so by penalizing higher terms in the model. With the addition of a regularization term, the model tries to minimize both loss and complexity.
Two main types of regularization are L1 and L2 regularization. L1 regularization subtracts a small amount from the weights of uninformative features at each iteration. Thus, it causes these weights to eventually become zero.
On the other hand, L2 regularization removes a small percentage from the weights at each iteration. These weights will get closer to zero but never actually become 0.
4. What is the difference between classification and clustering?
Both are machine learning tasks. Classification is a supervised learning task so we have labelled observations (i.e. data points). We train a model with labelled data and expect it to predict the labels of new data.
For instance, spam email detection is a classification task. We provide a model with several emails marked as spam or not spam. After the model is trained with those emails, it will evaluate the new emails appropriately.
Clustering is an unsupervised learning task so the observations do not have any labels. The model is expected to evaluate the observations and group them into clusters. Similar observations are placed into the same cluster.
In the optimal case, the observations in the same cluster are as close to each other as possible and the different clusters are as far apart as possible. An example of a clustering task would be grouping customers based on their shopping behavior.
Python
The built-in data structures are of crucial importance. Thus, you should be familiar with what they are and how to interact with them. List, dictionary, set, and tuple are 4 main built-in data structures in Python.
5. What is the difference between lists and tuples
The main difference between lists and tuples is mutability. Lists are mutable so we can manipulate them by adding or removing items.
mylist.append(4)
mylist.remove(1)
print(mylist)
[2,3,4]
On the other hand, tuples are immutable. Although we can access each element in a tuple, we cannot modify its content.
mytuple.append(4)
AttributeError: 'tuple' object has no attribute 'append'
One important point to mention here is that although tuples are immutable, they can contain mutable elements such as lists or sets.
mytuple[2]
['a', 'b', 'c']
mytuple[2][0] = ["A"]
print(mytuple)
(1, 2, [['A'], 'b', 'c'])
6. What is the difference between lists and sets
Let’s do an example to demonstrate the main difference between lists and sets.
myset = set(text)
print(mylist)
['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '!']
print(myset)
{'t', ' ', 'i', 'e', 'm', 'P', '!', 'y', 'o', 'h', 'n', 'a', 's', 'w'}
As we notice in the resulting objects, the list contains all the characters in the string whereas the set only contains unique values.
Another difference is that the characters in the list are ordered based on their location in the string. However, there is order associated with the characters in the set.
Here is a table that summarizes the main characteristics of lists, tuples, and sets.
0 Comments