Data Manipulation & Cleaning -Part 2

Part 2- Data Wrangling & Preprocessing

  1. Where does data come from?
  • Data can come from various ways, likewise from proprietary data sources, Government data sets, Web search, Academic data sets, Sensor data & Crowdsourcing, By researcher (Creating own datasets), etc.

2. If there is bad data, what happens then?

  • If there is bad data, we can get bad results, which can affect the whole process, for final output also.
  • Incorrect analysis can happen, invalid insights, poor outcomes, wrong decisions can be taken, likewise, several issues can happen.

3. What Is Data Wrangling?

  • Data Wrangling (Data Munging) is the process of converting “raw” data into data that can be explored and analyzed to generate valid actionable insights.

4. What are the common problems with data?

  • Missing values
  • Outliers
  • Duplicates
  • Untidy data

5. Dealing with missing values

  • Removing real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset .
  • There are some ways to handle missing values in the dataset

▹Deleting Rows with missing values

▹Impute missing values for continuous variable

▹Impute missing values for categorical variable

▹Other Imputation Methods

▹Using Algorithms that support missing values

▹Prediction or filling of missing values

6. Dealing with duplicate values

  • First of all, you may want to check if you have duplicate records. If you don’t, you may not need the rest of this post at all. This checks if the whole row appears elsewhere with the same values in each column.
  • We can be removed & dropping duplicates rows, can solve this

7. Dealing with outliers

  • An outlier is any data point that is distinctly different from the rest of your data points. There are mainly 4 approaches

★ Drop the outlier records -First, have to detect the outliers, by calculating quartiles, IQR (inter Quartile Range) & remove the values outside 1.5 times IQR.

★ Cap your outliers data

★ Assign a new value

★ Try a transformation

8. Dealing with untidy data

  • Data scientists spend about 80% of their time cleaning and organizing the data. Therefore data manipulation is the solution for that.

if we get tidy data it will very useful for everyone.

In tidy data:

▹Each variable must have its own column.

▹Each observation must have its own row.

▹Each type of observational unit forms a table.

9. What is Data Preprocessing?

Data preprocessing is a process that should be conducted before starting analysis or model fitting Some popular preprocessing steps are,

  • Data Cleaning (Missing Values, Duplicates Outliers)
  • Data Manipulation
  • Feature Engineering
  • Dummy Conversion Trapping
  • Sliding Window for Dates
  • Standardizing

Final Year Undergraduate, University of Moratuwa.