![]() The process of filling in missing data with average data from the rest of the data set is called imputation. On the other hand, the Cabin data is missing enough data that we could probably remove it from our model entirely. The Age column in particular contains a small enough amount of missing that that we can fill in the missing data using some form of mathematics. You can see that the Age and Cabin columns contain the majority of the missing data in the Titanic data set. In this visualization, the white lines indicate missing values in the dataset. Here is the visualization that this generates: In this example, you could create the appropriate seasborn plot with the following Python code: non-survivors exist in our training data.Īn easy way to visualize this is using the seaborn plot countplot. For this specific problem, it's useful to see how many survivors vs. When using machine learning techniques to model classification problems, it is always a good idea to have a sense of the ratio between categories. Learning About Our Data Set With Exploratory Data Analysis The Prevalence of Each Classification Category Next up, we will learn more about our data set by using some basic exploratory data analysis techniques. Embarked: the port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton).Fare: how much the passenger paid for their ticket on the Titanic.Parch: the number of parents and children aboard the ship.SibSp: the number of siblings and spouses aboard the ship.Age: the age (in years) of the passenger.This can hold a value of 1, 2, or 3, depending on where the passenger was located in the ship. Pclass: the passenger class of the passenger in question.This variable will hold a value of 1 if they survived and 0 if they did not. Survived: a binary identifier that indicates whether or not the passenger survived the Titanic crash.PassengerId: a numerical identifier for every passenger on the Titanic.Here are brief explanations of each data point: These are the names of the columns in the DataFrame. Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', Here are the imports you will need to run to follow along as I code through our Python logistic regression model: The Imports We Will Be Using in This TutorialĪs before, we will be using multiple open-source software libraries in this tutorial. Once this file has been downloaded, open a Jupyter Notebook in the same working directory and we can begin building our logistic regression model. You can download the data file by clicking the links below: The cleaned Titanic data set has actually already been made available for you. To make things easier for you as a student in this course, we will be using a semi-cleaned version of the Titanic data set, which will save you time on data cleaning and manipulation. The original Titanic data set is publicly available on, which is a website that hosts data sets and data science competitions. ![]() In this tutorial, we will be using the Titanic data set combined with a Python logistic regression model to predict whether or not a passenger survived the Titanic crash. It is often used as an introductory data set for logistic regression problems. The Titanic data set is a very famous data set that contains characteristics about the passengers on the Titanic. ![]() The Data Set We Will Be Using in This Tutorial
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |