1) Handling Missing Data
Missing Data (결측값)
- 다양한 이유로 측정 data의 값이 missing 일 수 있음 ➢ e.g., 응답을 안했거나, 측정이 불가능 하거나, 기기의 오류, 기술의 한계 등
- Missing value을 처리 하지 않고 분석을 하면 mean, variance등의 계산 혹은 추후에 모델을 만들 때 에러가 생김 ➢ Package마다 기본으로 내장된 처리 방식이 존재 하기도 함 (e.g., pandas는 missing값 제외하고 계산)
- 다양한 방법으로 Missing value를 처리
- Removal or deletion of missing value.
- Impute missing value with Mean/Median/Mode.
- Inference (Regression / Classification)
- sklearn library / deep learning library
- Domain specific
Datapoint or Feature deletion
- Missing이 있는 attribute를 지우거나 (column)
- Missing이 있는 sample을 지우거나 (row)
- 가장 편하고 쉬운 방법이지만 missing이 많은 경우 대부분의 data가 사라질 수 있음
- Data가 비싼경우는 쉽지 않음 (e.g., 생물 실험 데이터, 병원 환자 데이터)
- 추후에 통계 분석을 할 때 data 분석에서 data는 많을 수록 좋은데 data 수를 줄이게 됨
Imputing missing value
- Imputation is a method of filling missing values with numbers using a specific strategy
- mean / median / mode
- distinct value such as 0 or -1
- random selection on existing values -> 위의 3개가 심플하지만 다른 attribute와의 연관성으로 더 나은 estimation의 가능성을 무시 – 예측모델이 예측한 값 (인공지능 내용을 배우고 다시 공부해보는 것을 추천)
- KNN / Naïve Bayes
Replace missing values
- for numerical values replace the missing value with the average value of the column
- for categorial values replace the missing value with the most frequent value of the column
- use other functions
2) Data Transformation
- Data transformation is data preprocessing technique used to reorganize or restructure the raw data i
- n such a way that the data mining retrieves strategic information efficiently and easily.
- Removing Duplicates
- Transforming Data Using a Function or Mapping
- Replacing Values
- Renaming Axis Indexes
- Discretization and Binning
- Detecting and Filtering Outliers
- Permutation and Random Sampling
- Computing Indicator/Dummy Variables
Discretization and Binning
- Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets
Data Aggregation
- Data collection or aggregation is the method of storing and presenting data in a summary format.
- The data may be obtained from multiple data sources to integrate these data sources into a data analysis description.
- This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data used.
Data Normalization
- Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0, 1.0]
Detecting and Filtering Outliers
- What is outlier?
- An observation in a given dataset that lies far from the rest of the observations, which means an outlier is vastly larger or smaller than the remaining values in the set.
- Why do they occur?
- due to the variability in the data, or due to experimental error/human error.
- error vs novelty
- How to detect outlier?
- Visual inspection
- Z-score
- Inter Quantile Range(IQR)
- ML algorithms
- Visual inspection
- Look at descriptive statistics
- Use graphical diagnostic tools, e.g. the boxplot graph
- Z-scores
- Compute the Z-score using the formula (Xi-mean)/std Define a threshold value of 3 and mark the datapoints whose absolute value of Z-score is greater than the threshold as outliers
- Inter Quantile Range (IQR)
- Sort the dataset in ascending order
- Calculate the 1st and 3rd quartiles(Q1, Q3)
- Compute IQR=Q3-Q1 – Compute lower bound = (Q1–1.5IQR), upper bound = (Q3+1.5IQR)
- Check for those who fall below the lower bound and above the upper bound and mark them as outliers
- Handling Outliers
- Trimming/removing the outlier: remove the outliers from the dataset
- Quantile based flooring and capping: the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value
- Visual inspection
'Courses > 통계적데이터분석' 카테고리의 다른 글
04_2 확률 (Probability) (0) | 2025.01.22 |
---|---|
03_통계량 (0) | 2025.01.22 |
02_시각화 (2) | 2025.01.22 |