Courses/통계적데이터분석

04_1 Data Preprocessing

noweahct 2025. 1. 22. 15:27

1) Handling Missing Data

Missing Data (결측값)

  • 다양한 이유로 측정 data의 값이 missing 일 수 있음 ➢ e.g., 응답을 안했거나, 측정이 불가능 하거나, 기기의 오류, 기술의 한계 등
  • Missing value을 처리 하지 않고 분석을 하면 mean, variance등의 계산 혹은 추후에 모델을 만들 때 에러가 생김 ➢ Package마다 기본으로 내장된 처리 방식이 존재 하기도 함 (e.g., pandas는 missing값 제외하고 계산)
  • 다양한 방법으로 Missing value를 처리
    • Removal or deletion of missing value.
    • Impute missing value with Mean/Median/Mode.
    • Inference (Regression / Classification)
    • sklearn library / deep learning library
    • Domain specific

Datapoint or Feature deletion

  • Missing이 있는 attribute를 지우거나 (column)
  • Missing이 있는 sample을 지우거나 (row)
  • 가장 편하고 쉬운 방법이지만 missing이 많은 경우 대부분의 data가 사라질 수 있음
    • Data가 비싼경우는 쉽지 않음 (e.g., 생물 실험 데이터, 병원 환자 데이터)
    • 추후에 통계 분석을 할 때 data 분석에서 data는 많을 수록 좋은데 data 수를 줄이게 됨

Imputing missing value

  • Imputation is a method of filling missing values with numbers using a specific strategy
    • mean / median / mode
    • distinct value such as 0 or -1
    • random selection on existing values  -> 위의 3개가 심플하지만 다른 attribute와의 연관성으로 더 나은 estimation의 가능성을 무시 – 예측모델이 예측한 값 (인공지능 내용을 배우고 다시 공부해보는 것을 추천)
    • KNN / Naïve Bayes

Replace missing values

  • for numerical values replace the missing value with the average value of the column
  • for categorial values replace the missing value with the most frequent value of the column
  • use other functions

2) Data Transformation

  • Data transformation is data preprocessing technique used to reorganize or restructure the raw data i
  • n such a way that the data mining retrieves strategic information efficiently and easily.
    • Removing Duplicates
    • Transforming Data Using a Function or Mapping
    • Replacing Values
    • Renaming Axis Indexes
    • Discretization and Binning
    • Detecting and Filtering Outliers
    • Permutation and Random Sampling
    • Computing Indicator/Dummy Variables

Discretization and Binning

  • Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets

Data Aggregation

  • Data collection or aggregation is the method of storing and presenting data in a summary format.
    • The data may be obtained from multiple data sources to integrate these data sources into a data analysis description.
    • This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data used.

Data Normalization

  • Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0, 1.0]

Detecting and Filtering Outliers

  • What is outlier?
    • An observation in a given dataset that lies far from the rest of the observations, which means an outlier is vastly larger or smaller than the remaining values in the set.
  • Why do they occur?
    • due to the variability in the data, or due to experimental error/human error.
    • error vs novelty
  • How to detect outlier?
    • Visual inspection
    • Z-score
    • Inter Quantile Range(IQR)
    • ML algorithms
      • Visual inspection
        • Look at descriptive statistics
        • Use graphical diagnostic tools, e.g. the boxplot graph
      • Z-scores
        • Compute the Z-score using the formula (Xi-mean)/std Define a threshold value of 3 and mark the datapoints whose absolute value of Z-score is greater than the threshold as outliers
      • Inter Quantile Range (IQR)
        • Sort the dataset in ascending order
        • Calculate the 1st and 3rd quartiles(Q1, Q3)
        • Compute IQR=Q3-Q1 – Compute lower bound = (Q1–1.5IQR), upper bound = (Q3+1.5IQR)
        • Check for those who fall below the lower bound and above the upper bound and mark them as outliers
      • Handling Outliers
        • Trimming/removing the outlier: remove the outliers from the dataset
        • Quantile based flooring and capping: the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value

 

'Courses > 통계적데이터분석' 카테고리의 다른 글

04_2 확률 (Probability)  (0) 2025.01.22
03_통계량  (0) 2025.01.22
02_시각화  (2) 2025.01.22