Read the Technical blog on Best software Development Company in Pune and Best Blogs Related to Technical Information Pune.

Best Software Development Company in Pune

test

Best Software Development Firm / Company in India/ Pune

Post Top Ad

Your Ad Spot

Featured Post

What is Software Testing Life Cycle (STLC) phases?

Sunday, May 3, 2020

What is Insight of Data Pre-Processing -By Aakash Rodhe


Data preprocessing is a data mining technique that involves transforming raw data
into an understandable format. Real-world data is often incomplete, inconsistent,
and/or lacking in certain behaviors or trends, and is likely to contain many errors.
Data preprocessing is one of the important tasks to perform before creating a model
on which we can perform the Machine Learning and to apply the business intelligence
on the data.


Data preprocessing is involved in the industry like in weather forecasting, Banking
and insurance, Pharmaceutical industry, Drug design, Predicting capital market behavior,
Understanding customers, Designing robots and self-driving cars. 


The steps involved in the data preprocessing are:
  • Missing Value Analysis
  • Outlier Analysis
  • Feature Scaling 
  • Feature Selection
  • Sampling 


Missing Value Analysis


Missing value analysis helps address several concerns caused by incomplete data.
If cases with missing values are systematically different from cases without missing values,
the results can be misleading. Also, missing data may reduce the precision of calculated
statistics because there is less information than originally planned. Another concern is that
the assumptions behind many statistical procedures are based on complete cases, and
missing values can complicate the theory required.


There are multiple method through which we can fill the missing values are Fill with Central
Statistics (Mean, Median, Mode), Distance based (K Nearest) and Prediction method.


Also, sometimes the missing value is removed when missing value percentage is very  less.
Outlier Analysis
Outliers are extreme values that deviate from other observations on data , they may indicate
a variability in a measurement, experimental errors or a novelty. In other words, an outlier is
an observation that diverges from an overall pattern on a sample.
Cause of the outlier are Poor data quality, Low quality measurements, malfunctioning
equipment, manual error and Correct but exceptional data.
Box plot is one of graphical methods through which we can find outlier. According to box plot the values above the upper quartile and values which are below the lower quartile are the
outlier.


Feature Selection


Selecting a subset of relevant variable and predictors for use in model construction. subset of
a learning algorithm’s input variables upon which it should focus attention, while ignoring the
rest. It also reduce dimensionality.  


Feature selection can be done using the following Statistical Techniques & Algorithm are
Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Random Forest,
Correlation Analysis, Chi-square Test, ANOVA.


Feature Scaling


It is a method to bring all the continuous variable in the particular range by using
Normalization and Standardization. In this we also bring all the value in the same standard
for example one of the variables is Time in the data so some time is in hours and some in
minutes. In this case we also change all the value in single unit and then perform feature
scaling.


Sampling


This is the last part of Data Preprocessing where we split the whole preprocessed data, from
which we can notice characteristics of the data.


Types of Sampling method
  • Probability Sampling
    • Simple Random
    • Systematic Random
    • Stratified
    • Multi - Stage Cluster
  • Non - Probability Sampling
    • Convenience
    • Snowball
    • Theoretical
    • Quota


Through sampling we can split the data in training and test data.

One of the best use cases of data preprocessing in the field of
telecommunication is Churn reduction which applies the data
preprocessing before coming to the conclusion of the reason
due to which customers are leaving the current  telecom provider.

No comments:

Post a Comment

Post Top Ad

Your Ad Spot

Pages