04_1 Data Preprocessing

Courses/통계적데이터분석

04_1 Data Preprocessing

noweahct 2025. 1. 22. 15:27

1) Handling Missing Data

Missing Data (결측값)

다양한 이유로 측정 data의 값이 missing 일 수 있음 ➢ e.g., 응답을 안했거나, 측정이 불가능 하거나, 기기의 오류, 기술의 한계 등
Missing value을 처리 하지 않고 분석을 하면 mean, variance등의 계산 혹은 추후에 모델을 만들 때 에러가 생김 ➢ Package마다 기본으로 내장된 처리 방식이 존재 하기도 함 (e.g., pandas는 missing값 제외하고 계산)
다양한 방법으로 Missing value를 처리
- Removal or deletion of missing value.
- Impute missing value with Mean/Median/Mode.
- Inference (Regression / Classification)
- sklearn library / deep learning library
- Domain specific

Datapoint or Feature deletion

Missing이 있는 attribute를 지우거나 (column)
Missing이 있는 sample을 지우거나 (row)
가장 편하고 쉬운 방법이지만 missing이 많은 경우 대부분의 data가 사라질 수 있음
- Data가 비싼경우는 쉽지 않음 (e.g., 생물 실험 데이터, 병원 환자 데이터)
- 추후에 통계 분석을 할 때 data 분석에서 data는 많을 수록 좋은데 data 수를 줄이게 됨

Imputing missing value

Imputation is a method of filling missing values with numbers using a specific strategy
- mean / median / mode
- distinct value such as 0 or -1
- random selection on existing values -> 위의 3개가 심플하지만 다른 attribute와의 연관성으로 더 나은 estimation의 가능성을 무시 – 예측모델이 예측한 값 (인공지능 내용을 배우고 다시 공부해보는 것을 추천)
- KNN / Naïve Bayes

Replace missing values

for numerical values replace the missing value with the average value of the column
for categorial values replace the missing value with the most frequent value of the column
use other functions

2) Data Transformation

Data transformation is data preprocessing technique used to reorganize or restructure the raw data i
n such a way that the data mining retrieves strategic information efficiently and easily.
- Removing Duplicates
- Transforming Data Using a Function or Mapping
- Replacing Values
- Renaming Axis Indexes
- Discretization and Binning
- Detecting and Filtering Outliers
- Permutation and Random Sampling
- Computing Indicator/Dummy Variables

Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets

Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary format.
- The data may be obtained from multiple data sources to integrate these data sources into a data analysis description.
- This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data used.

Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0, 1.0]

Detecting and Filtering Outliers

What is outlier?
- An observation in a given dataset that lies far from the rest of the observations, which means an outlier is vastly larger or smaller than the remaining values in the set.
Why do they occur?
- due to the variability in the data, or due to experimental error/human error.
- error vs novelty
How to detect outlier?
- Visual inspection
- Z-score
- Inter Quantile Range(IQR)
- ML algorithms
  - Visual inspection
    - Look at descriptive statistics
    - Use graphical diagnostic tools, e.g. the boxplot graph
  - Z-scores
    - Compute the Z-score using the formula (Xi-mean)/std Define a threshold value of 3 and mark the datapoints whose absolute value of Z-score is greater than the threshold as outliers
  - Inter Quantile Range (IQR)
    - Sort the dataset in ascending order
    - Calculate the 1st and 3rd quartiles(Q1, Q3)
    - Compute IQR=Q3-Q1 – Compute lower bound = (Q1–1.5IQR), upper bound = (Q3+1.5IQR)
    - Check for those who fall below the lower bound and above the upper bound and mark them as outliers
  - Handling Outliers
    - Trimming/removing the outlier: remove the outliers from the dataset
    - Quantile based flooring and capping: the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value

'Courses > 통계적데이터분석' 카테고리의 다른 글

04_2 확률 (Probability) (0)	2025.01.22
03_통계량 (2)	2025.01.22
02_시각화 (2)	2025.01.22

현재글04_1 Data Preprocessing

챙

ML/DL 공부 기록

데이터베이스, sc-rna sequencing,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

챙

04_1 Data Preprocessing

1) Handling Missing Data

Missing Data (결측값)

Datapoint or Feature deletion

Imputing missing value

Replace missing values

2) Data Transformation

Discretization and Binning

Data Aggregation

Data Normalization

Detecting and Filtering Outliers

'Courses > 통계적데이터분석' 카테고리의 다른 글

'Courses/통계적데이터분석'의 다른글

티스토리툴바

04_1 Data Preprocessing

1) Handling Missing Data

Missing Data (결측값)

Datapoint or Feature deletion

Imputing missing value

Replace missing values

2) Data Transformation

Discretization and Binning

Data Aggregation

Data Normalization

Detecting and Filtering Outliers

'Courses > 통계적데이터분석' 카테고리의 다른 글

'Courses/통계적데이터분석'의 다른글

관련글

티스토리툴바

'Courses > 통계적데이터분석' 카테고리의 다른 글

'Courses/통계적데이터분석'의 다른글