02_시각화

Courses/통계적데이터분석

02_시각화

noweahct 2025. 1. 22. 15:16

2023-2

02_1 시각화

기술 통계학(Descriptive Statistics): a summary statistic that quantitatively describes or summarizes features from a collection of information
추측 통계학(Inferential statistical analysis): infers properties of a population, e.g., by testing hypotheses and deriving estimates

시각화(Data Visualization)

숫자 형태의 데이터를 그림 형태로 표현하는 것
다양한 visualization package 개발됨(e.g., matplotlib, seaborn)
데이터의 특성과 보여주기 위한 목적에 맞는 plot을 그리는 것이 중요

종류

1. 산점도(Scatter Plot)

uses dots to represent values for two different numeric variables
x축과 y축으로 구성된 좌표계 위에 이차원(양적 변수 2개) 자료를 점으로 표현하여 두 변수 간의 관계를 나타내는 데 사용
Data 너무 많으면 → sampling, transparency 설정, 2-d dendrogram 이용 등을 생각해보기
Correlation vs. Causation

2. 막대 그래프(bar plot, bar chart, bar graph, column chart)

A bar chart plots numeric values for levels of a categorical feature as bars
이산형 혹은 질적 자료의 개수를 나타내기 위해 사용
생각해 볼 것

3. 히스토그램(Histogram)

a chart that plots the distribution of a numeric variable’s values as a series of bars
data의 distribution 시각화, 대략적인 분포 파악에 좋음
생각해 볼 것

+) 히스토그램은 연속형 데이터에 사용되는 반면, 막대 차트는 범주형 또는 명목형 데이터에 사용한다. 히스토그램은 발생 빈도에 대한 데이터를 볼 수 있고 막대 그래프는 x축에 따른 분포표를 볼 수 있다.

4. 원 도표(Pie-Chart)

shows how a total amount is divided between levels of a categorical variable as a circle divided into radial slices
생각해 볼 것

5. 선도표(Line-Chart, line plot, line graph)

uses points connected by line segments from left to right to demonstrate changes in value
x의 변화에 따른 y의 변화, 기울기 등의 패턴을 표현하기에 적합함
use to emphasize changes in values for one variable (plotted on the vertical axis) for continuous values of second variable (plotted on the horizontal)
생각해 볼 것

6. Heatmap

depicts values for a main variable of interest across two axis variables as a grid of colored squares
항상 grid로 제한된 것은 아니고, 지도에도 표시할 수 있음
used to show relationships between two variables, one plotted on each axis. By observing how cell colors change across each axis, you can observe if there are any patterns in value for one or both variables
각 axis는 categorical label, numeric value(→ binning 필요) 모두 가능
각 cell의 value는 frequency count, mean, median 같은 값
생각해 볼 것

02_2 Anaconda Numpy Pandas

Virtual Environment 쓰는 이유

python은 동시에 여러 버전 설치를 지원하지 않아 package 버전이 달라야 할 때 virtual environment를 사용해 project마다 분리된 환경에서 작업 가능
(서버에서 작업을 하는 경우 root 권한이 없으면 package 설치 권한 관련 이슈)

Pandas

table 형태의 데이터를 처리하고 분석하기 위한 라이브러리
dataframe: a collection of series(colomn)

Copy

pd.DataFrame(data)
pd.DataFrame(data, index = ["A", "B", "C", "D"]) # 행 이름 설정
pd.DataFrame(data_2d, columns=[f'col{i}' for i in range(10)], 
															index=[f'row{i}' for i in range(5)])
pd.read_csv('ds_salaries.csv', index_col=0) # 파일의 첫 번째 열이 행 인덱스로 설정													
# 03_2 시각화 실습2
pd.read_csv("covid19_utf8.csv', parse_dates=["자치구 기준일"])
pd.Series(data.iloc[-1].values - data.iloc[0].values, index=data.columns) 
		# 마지막 행과 첫 번째 행의 각 열 값의 차이를 계산하고 Series로 반환
pd.DatetimeIndex(data['자치구 기준일']).month # 날짜형인 '자치구 기준일' 열에서 월만 뽑아내기
# 04_1 데이터 전처리
pd.read_csv(resource.descriptor['path']) 
												# 실제 파일 경로를 포함한 딕셔너리의 'path' 키를 사용하여 파일을 읽음

**iloc: data의 순서로 접근, loc: data의 index로 접근

'Courses > 통계적데이터분석' 카테고리의 다른 글

04_2 확률 (Probability) (0)	2025.01.22
04_1 Data Preprocessing (0)	2025.01.22
03_통계량 (2)	2025.01.22

현재글02_시각화

챙

ML/DL 공부 기록

데이터베이스, sc-rna sequencing,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

챙

02_시각화

02_1 시각화

시각화(Data Visualization)

종류

02_2 Anaconda Numpy Pandas

Virtual Environment 쓰는 이유

Pandas

'Courses > 통계적데이터분석' 카테고리의 다른 글

'Courses/통계적데이터분석'의 다른글

티스토리툴바

02_시각화

02_1 시각화

시각화(Data Visualization)

종류

02_2 Anaconda Numpy Pandas

Virtual Environment 쓰는 이유

Pandas

'Courses > 통계적데이터분석' 카테고리의 다른 글

'Courses/통계적데이터분석'의 다른글

관련글

티스토리툴바

'Courses > 통계적데이터분석' 카테고리의 다른 글

'Courses/통계적데이터분석'의 다른글