How to Analyze Data through Visualization
This tutorial will help you understand how Data Visualization can help to analyse patterns that may result in a Linear Regression model
What is a plot?
- A visual representation of the data
Which data? How is it usually structured?
- In a table. For example:
import seaborn as sns
df = sns.load_dataset('mpg', index_col='name')
df.head()
How can you Visualice this DataFrame
?
- We could make a point for every car based on
- weight
- mpg
sns.scatterplot(x='weight', y='mpg', data=df);
Which conclusions can you make out of this plot?
Well, you may observe that the location of the points are descending as we move to the right
This means that the
weight
of the car may produce a lower capacity to make kilometresmpg
How can you measure this relationship?
- Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=df[['weight']], y=df.mpg)
model.__dict__
- Resulting in ↓
{'fit_intercept': True,
'normalize': False,
'copy_X': True,
'n_jobs': None,
'n_features_in_': 1,
'coef_': array([-0.00767661]),
'_residues': 7474.8140143821,
'rank_': 1,
'singular_': array([16873.20281508]),
'intercept_': 46.31736442026565}
Which is the mathematical formula for this relationship?
$$mpg = 46.31 - 0.00767 \cdot weight$$
- This equation means that the
mpg
gets 0.00767 units lower for every unit thatweight
increases.
Could you visualise this equation in a plot?
- Absolutely, we could make the predictions from the original data and plot them.
Predictions
y_pred = model.predict(X=df[['weight']])
dfsel = df[['weight', 'mpg']].copy()
dfsel['prediction'] = y_pred
dfsel.head()
weight | mpg | prediction | |
---|---|---|---|
name | |||
chevrolet chevelle malibu | 3504 | 18.0 | 19.418523 |
buick skylark 320 | 3693 | 15.0 | 17.967643 |
plymouth satellite | 3436 | 18.0 | 19.940532 |
amc rebel sst | 3433 | 16.0 | 19.963562 |
ford torino | 3449 | 17.0 | 19.840736 |
Out of this table, you could observe that predictions don't exactly match the reality, but it approximates.
For example, Ford Torino's
mpg
is 17.0, but our model predicts 19.84.
Model Visualization
sns.scatterplot(x='weight', y='mpg', data=dfsel)
sns.scatterplot(x='weight', y='prediction', data=dfsel);
- The blue points represent the actual data.
- The orange points represent the predictions of the model.
I teach Python, R, Statistics & Data Science. I like to produce content that helps people to understand these topics better.
Feel free and welcomed to give me feedback as I would like to make my tutorials clearer and generate content that interests you 🤗
You can see my Tutor Profile here if you need Private Tutoring lessons.