Data Pre-processing can be referred as the manipulation of data before it is being used so as to ensure performance and it is a very important step to be done as a Data Scientist. Data Pre-processing can be generally classified as Data cleaning, Data Transformation, and Data Reduction.
The data we get is essentially raw, and can have many irrelevant and missing parts. To handle these parts, Data Cleaning is done. It involves handling of missing data, filtering out noisy data etc.
Data Cleaning and preparation is the most time consuming part of any Data Science project. Thankfully, there are many powerful tools to speedup this process. One of them is Pandas package which is a widely used library in Python for Data Analysis. Handling missing data is an essential part of data cleaning because all the real row data in real life will definitely have missing values.
Dealing with missing data in Pandas
Standard missing values are represented as np.nan, None and NaT, for datatime64[ns] types, for python. The representation np.nan is float so if we use them in a column that contains integers, they will be converted to floating points datatypes.
To clean up missing values, first we have to recognize them. So, for finding missing data points, pandas provides isnull( ), isna( ) as functions to detect missing values. We can also choose to use notna( ) which ithe just opposite of isna( ). isna( ).any( ) return a Boolean output for each column. If there is atlieast one missing value in that column, the result will be True.
Once we have found out the missing values, we are supposed to clean it up. But, the problem is that not all missing values come in beautiful format of np.nan or None. Sometimes, “?” or “- -” characters come into action while representing the missing data points.
However, these characters cannot be effectively identified by Pandas as missing data. In these situation, we can use replace( ) function in pandas to handle these values.
There is no optimal way to handle missing values. Depending on the characteristics of the data and task at hand, we can opt to either Drop the missing values or to replace missing values.
To drop a row or column with missing values, we can use dropna( ) function. The “how” parameter is set condition to drop: “any” to drop if there is any missing values and “all” to drop if all values are missing.
Data is very valuable that we cannot easily drop the data points on our every whim. Machine learning models tend to perform better with more data for training. So, depending on the situation, we can choose to replace the missing value instead of dropping.
The fillna( ) function in pandas beautifully handles the missing data by replacing them by a special value or aggregate value such as mean or median.
A deep understanding in Data Pre-processing and thereby in Data Cleaning is very important as a Data Scientist and suitable training in the Data Science training in Kerala makes a huge difference. The right kind of training when dealing with the Data is very important as the conversion of raw, real life data to processed data is the key to Data Analysis and such training can be availed by extensive course provided by Data Science training institute in Kochi.