03_통계량

Courses/통계적데이터분석

03_통계량

noweahct 2025. 1. 22. 15:19

03_1 시각화 실습

Matplotlib

a comprehensive library for creating static, animated, and interactive visualizations in Python

Seaborn

a Python data visualization library based on matplotlib
provides a high-level interface for drawing attractive and informative statistical graphics
pandas에서 쓰기 더 편한 module임

03_2 통계량

Parameter & Statistics

Parameter(모수): A number describing a whole population
- population is a set of all elements or items that you’re interested in. (e.g., population mean, population deviation)
Statistic (통계량): A number describing a sample
- Ideally, the sample should preserve the essential statistical features of the population to a satisfactory extent. (e.g., sample mean, sample deviation)
Sample Statistic vs. Population Parameter

Statistics(Measures of Central Tendency)

In statistics, a central tendency(or measure of central tendency) is a central or typical value for a probability distribution

Measures of Central Tencendy

1. Mean

2. Trimmed Mean(절사 평균): 극단적으로 크거나 작은 값의 영향을 제거하기 위해 사용

3. Weighted Mean(가중 평균): 각 변수의 중요도가 다른 경우)

4. Median

Median and Robust Estimates: Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal

5. Skewness(왜도): 데이터 분포의 비대칭성

If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
If the skewness is between -1 and -0.5 (negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.

Estimates of Variability

1. Variance

2. Standard deviation

3. Mean absolute deviation(평균 절대 편차)

4. Median absolute deviation(중앙값 절대 편차) → a robust estimate of variability

5. Kurtosis(첨도): 정규분포와 비교했을 때 data distribution의 뾰족한 정도

Estimates Based on Percentiles

statistics is based on sorted data and the most basic measure is the range(= the difference between the largest and smallest numbers). So following problems can exist:

1. Percentile: In a data set, the Pth percentile is a value such that at least P percent of the values take on this value or less and at least (100 – P) percent of the values take on this value or more.

2. Quartile(사분위수): Data를 sorting 하고 25%, 50%, 75%, 100% 의 값

– Q1: lower quartile / 25th percentile

– Q2: median / 50th percentile

– Q3: upper quartile / 75th percentile

3. IQR(Interquartile range, 사분위수 범위): Q3 - Q1

data의 분산 정도를 측정하는 measure 중 하나
IQR이 크면 데이터가 흩어져 있을 가능성이 높음

Boxplot: uses boxes and lines to depict the distributions of one or more groups of numeric data
- data의 경향성을 보여줌
- group, category 간 비교가 필요할 때
- skew, variance, outlier 등을 알 수 있음
- data distribution의 자세한 형태는 알 수 없음

Simpson’s Paradox: a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science statistics