네이버 부스트코스에서 제공하는 최성철 님의 강의를 참고하여 작성된 포스팅입니다.

numpy

Numerical Python
파이썬의 고성능 과학 계산용 패키지
Matrix와 Vector와 같은 Array 연산의 사실상의 표준
일반 List에 비해 빠르고, 메모리 효율적
반복문 없이 데이터 배열에 대한 처리를 지원함
선형대수와 관련된 다양한 기능을 제공함
C, C++, 포트란 등의 언어와 통합 가능

ndarray

numpy는 하나의 데이터 type만 배열에 넣을 수 있음

dynamic typing과 static typing의 장단점
- NumPy는 dynamic typing을 포기함으로써 list comprehension 이상의 속도를 얻었습니다.
- What are the pros and cons and needs for static/dynamic type checker or both?

파이썬 객체 모델의 비효율적인 메모리 액세스 (static VS dynamic typing)

출처 : https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/

리스트와 ndarray 차이?

메모리의 크기 일정하므로 저장도 훨씬 효율적
- 리스트 : 메모리 주소가 static
- a = [1,2,3,4,5]; b = [5,4,3,2,1] a[0] is b[-1] # True
- ndarray : 값이 순차적으로 저장 (static) = 서로 다른 메모리 주소 가짐
- a = np.array(a); b = np.array(b) a[0] is b[-1] # False

shape : dimension 반환. 헷갈리는 개념이니까 꼭 잘 익혀두자!

dim, row, col
dtype : ndarray의 데이터 type 반환
array rank에 따른 명칭
nbytes : ndarray object의 메모리 크기 반환 (byte 단위)출력값 : 24 = (32bits = 4bytes) → 6 * 4bytes
np.array([[1,2,3], [4.5, "5", "6"]], dtype=np.float32).nbytes

handling shape

reshape : array shape을 변경함, element 개수는 동일
test_matrix = [[1,2,3,4], [1,2,5,8]] print(np.array(test_matrix).shape) # (2,4) print(np.array(test_matrix).reshape(8,).shape) # (8,) print(np.array(test_matrix).reshape(-1, 2).shape) # (4,2) # -1은 전체 사이즈를 기반으로 나머지를 선정

flatten : 다차원 → 고차원

Indexing & Slicing

리스트와 달리 이차원 배열에서 [0,0] 표기법 제공
print(a[0,0]) #a[0][0]
리스트와 달리 행과 열 부분 나눠서 슬라이싱 가능
이렇게 건너뛰기도 가능하다!
a = np.array([[1,2,3,4,5],[6,7,8,9,10]],int) a[:,2:]#전체 Row의 2열 이상 a[1,1:3]#1Row의 1열 ~2열 a[1:3]#1Row~2Row의 전체

creation function

np.arange : array 범위 지정해서 값의 리스트 생성
np.arange(시작, 끝, step)

np.zeros : 0으로 가득 찬 ndarray
np.zeros(shape=(10,), dtype=np.int8)
np.ones : 1로 가득찬 array
np.empty : shape만 주어지고 비어있는 ndarray. 메모리 초기화가 되지 않음
np.zeros_like, np.ones_like, np.empty_like : 입력받은 기존 ndarray shape 만큼의 array를 반환
test_matrix = np.arange(30).reshape(5,6) np.ones_like(test_matrix)

identity : 단위행렬 생성
np.identity(n=3, dtype=np.int8)

eye : 대각선이 1인 행렬 (k값 시작 인덱스 변경 가능)
np.eye(3,5,k=2)

diag : 대각 행렬의 값을 추출

  matrix = np.arange(9).reshape(3,3)
  np.diag(matrix) # array([0,4,8])

np.diag(matrix, k=1) # array([1,5]). k=1부터 시작함

random sampling

  # (최소값, 최대값, 전체 크기)
  np.random.uniform(0,1,10).reshape(2,5) # 균등분포
  np.random.normal(0,1,10).reshape(2,5) # 정규분포

데이터 분포에 따른 샘플링으로 array 생성

operation functions

sum : element 합
axis : [2차원] 0 = 열방향, 1 = 행방향 / [3차원] 0 = 차원 방향, 1 = 열방향, 2 = 행방향
mean & std : ndarray의 element들 간의 평균 또는 표준 편차를 반환
vstack & hstack
- vstack : 행을 이어붙여서 쌓는다
- hstack : 열을 이어붙여서 쌓는다
concatenate : axis 지정에 따라 vstack, hstack과 동일한 기능

array operations

element-wise operations : array간 shape가 같을 때 일어나는 연산
matrix_a = np.arange(1,13).reshape(3,4) matrix_a * matrix_a

dot-product : 내적 연산
a = np.arange(1,7).reshape(2,3) b = np.arange(7,13).reshape(3,2) a.dot(b) ''' [[58, 64], [139,154]] '''

newaxis
b.reshape(-1,2) b = b[np.newaxis, :] # 새로운 축 추가
행렬 연산 시 shape을 맞춰주기 위해

transpose(), .T : 전치행렬 생성
a.transpose() a.T

broadcasting : array간 shape가 다를 때 일어나는 연산
ex. (scalar - vector), (vector - matrix)

timeit() : jupyter 환경에서 코드 퍼포먼스 체크

def sclar_vector_product(scalar,vector):
	result=[]
	for value in vector:
		result.append(scalar*value)
	return result

iternation_max =100000000
vector=list(range(iternation_max))
scalar=2

%timeit sclar_vector_product(scalar,vector) # for loop을 이용한 성능
%timeit [scalar*valueforvalueinrange(iternation_max)]
# list comprehension을 이용한 성능
%timeit np.arange(iternation_max)*scalar # numpy를 이용한 성능

# for loop < list comprehension < numpy
# 계산이 아닌 할당에서는 연산 속도 이점 없음 (ex. concatenate)

comparisons

all, any
np.all(조건문) # 모두가 조건에 만족한다면 true np.any(조건문) # 하나라도 조건 만족한다면 true

logical_and, logical_not, logical_or
a = np.array([1,3,0],float) np.logical_and(a>0, a<3) # 각각 boolean 값 리스트 생성 -> element wise 연산 후 반환 # array([True, False, False])

where(condition, True value, False value)
- True, False에 해당하는 value를 따로 지정해주지 않을 경우 조건문을 만족하는 index 값을 반환함

isnan(number), isfinite(number) : nan값인지 finite number인지 boolean 반환
argmax, argmin : array내 최대값, 최소값 인덱스 반환
- axis 설정 시 axis 별로 최대, 최소값 반환

fancy index

numpy는 array를 index value로 사용해서 값 추출 가능

a = np.array([2,4,6,8], float)
b = np.array([0,0,1,3,2,1], int) # 인덱스 배열은 반드시 int형으로
a[b] # [2,2,4,8,6,4]
a.take(b) # [2,2,4,8,6,4]

a = np.array([[1,4],[9,16], float)
b = np.array([0,0,1,1,0], int) # 인덱스 배열은 반드시 int형으로
c = np.array([0,1,1,1,1], int)
a[b, c] # b는 row index, c는 column index로 변환해 표시
# [1,14,16,16,4]

numpy data i/o

np.loadtxt : text 타입의 데이터 읽기
np.savetxt : text 타입 데이터 저장

참조 : https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html

np.save. np.load : numpy 객체 저장 / 불러오기np.load(file="numpy_object.npy")
np.save("numpy_object", arr=a_int_3)

Pandas

인덱싱,연산용 함수,전처리 함수 등을 제공함
데이터 처리 및 통계 분석을 위해 사용

pandas 구성 요소

DataFrame : 데이터 테이블 전체를 포함하는 Object
- dataframe.T : 전치 행렬
- dataframe.values : 값을 2차원 리스트로 출력
- dataframe.to_csv() : csv로 변환
- del dataframe[col_name] : 해당 column 삭제
DataFrame(raw_data, columns=column_list)
Series : DataFrame 전체 중 하나의 Column을 포함 (=Column vector)Series(data = list_data, index = list_name, dtype = np.float32, name="example series")
- series.values : 값 리스트 반환
- series.index : 인덱스 리스트 반환
name으로 series 이름도 설정 가능

indexing (loc, iloc)

loc : index location (인덱스 이름)
- 데이터프레임의 행이나 컬럼에 label이나 boolean array로 접근
- location의 약어로, 인간이 읽을 수 있는 label 값으로 데이터에 접근하는 것
iloc : index position (인덱스 넘버)
- 데이터프레임의 행이나 컬럼에 인덱스 값으로 접근
- integer location의 약어로, 컴퓨터가 읽을 수 있는 indexing 값으로 데이터에 접근하는 것
- 참조 : https://gagadi.tistory.com/16 [가디의 tech 스터디:티스토리]
loc vs iloc?
- loc : 인덱스 이름이 3인 곳까지 슬라이싱
- iloc : 세번째 인덱스 위치까지 슬라이싱

selection & drop

index 재설정df.reset_index()
df.index = list(range(15))
drop : index number로 axis =0 : 행 삭제, column name으로 axis =1 : 열 삭제
df.drop([1,2,4])

dataframe operations

series operations : index 기준으로 연산 수행하며 겹치는 index 없으면 NaN으로 반환
dataframe operations : df는 column과 index를 모두 고려함
add, sub,div, mul 연산 수행 시에는 NaN값을 fill_value에 지정한 값으로 채울 수 있음

Series + DataFrame

주어진 axis를 기준으로 broadcasting (axis=0 : 행, axis=1 : 열)

단순히 + 연산자를 사용하면 불가능하다!!ex. df.add(s2, axis=0)
어느 기준으로 broadcasting 해야할 지 모르기 때문.. → add 메서드 사용해야 함

lambda, map, apply

series type data에도 map 사용 가능

  s1 = Series(np.arange(10)) # [0,1,2,3,4,5,6,7,8,9]
  s1.map(lambda x : x**2) # [0,1,4,9,16..]

  # dict 형태도 사용 가능. dict key에 없는 값은 NaN으로 처리
  z = {1 : 'A', 2 : 'B', 3:'C'}
  s1.map(z) # [NaN, A, B, C, NaN ...]

  # 같은 series에서 같은 위치의 데이터를 전환
  s2 = Series(np.arange(10,20))
  s1.map(s2) # [10,11,12,13,14...]

  df["sex_code"] = df.sex.map({"male":0, "female":1})

s1.map(lambda x : x**2).head(5)

replace

  # dict 형태
  df.sex.replace(
      {"male":0, "female":1}
  )

  # target list - conversion list
  df.sex.replace(
      ["male", "female"],
      [0,1], inplace=True
  )

map 함수 기능 중 데이터 변환 기능만 담당

apply : map과 달리 series 전체 컬럼에 해당 함수 적용
```
  f = lambda x : x.max() - x.min()
  df_info.apply(f)
```

- scalar 값 말고 series 값 반환도 가능


    ```python
    def f(x):
        return Series([x.min(), x.max()], index=["min", "max"])
    df_info.apply(f)
    ```

applymap : series 단위가 아니라 element 단위로 함수 적용
= series 단위에 apply 적용할 때와 같음

pandas built-in functions

df.describe() : numeric type data 요약 정보
series.unique() : series data의 유일 값 list로 반환
df.sub, mean, min, max, count, median, mad, var (axis)
- mad : 평균 절대 편차
df.isnull : column, row값의 index 반환
- isnull().sum() : null인 값의 합
df.sort_values() : column 값 기준 data sorting
series.corr, series.cov, series.corrwith : series의 상관계수, 공분산, 다른 시리즈와의 상관계수

groupby

SQL의 groupby와 같음

split : 인덱스가 같은 것끼리 묶기
apply : sum, std 같은 함수 적용
combine : 합쳐서 하나의 결과를 보여줌
적용 방법 : df.groupby(기준 컬럼)[적용받는 컬럼].적용 연산()
ex. df.groupby("Team")["Points"].sum()

df.groupby("Team")["Points"].sum()
df.groupby(["Team", "Year"])["Points"].sum()
# 2개의 컬럼으로 groupby 하면 index 2개 생성 = hierarchical index

unstack : groupby로 묶여진 데이터를 matrix 형태로 전환
h_index.unstack() h_index.reset_index()

swaplevel, sortlevel : index level을 변경 가능
h_index.swaplevel() h_index.swaplevel().sortlevel()

기본 연산 (sum, std..)
- level = 0 : 가장 왼쪽의 인덱스
- level = 1 : 그 다음 인덱스
- h_index.sum(level = 0) h_index.sum(level = 1)
index level을 기준으로 수행 가능

grouped

groupby에 의해 split된 상태를 추출 가능

grouped = df.groupby("Team")

# 튜플 형태로 그룹의 key - value 값이 추출됨
for name, group in grouped:
    print(name)
    print(group)

get_group : 인자에 컬럼 이름을 넣으면 특정 key를 가진 그룹의 정보만 추출할 수 있음
grouped.get_group("Devils")

추출된 group 정보에서 세 가지 유형의 apply 가능

aggregation : 요약 통계 정보 추출grouped['Points'].agg([np.sum, np.mean, np.std])
특정 컬럼에 여러 개 function apply 가능
grouped.agg(sum)
transformation : 해당 정보 변환, aggregation과 달리 key값별로 요약된 것이 아닌, 개별 데이터의 변환을 지원함
score = lambda x : (x) grouped.transform(score)

filtration : 특정 정보를 제거해 보여주는 필터링 기능
df.groupby('Team').filter(lambda x : len(x) >= 3) # boolean 조건을 인자로 받음

pivot table

엑셀에서 보던 것과 동일
인덱스 축은 groupby와 동일
컬럼에 추가로 라벨링 값을 추가해 value에 numeric type 값을 aggregation

crosstab

두 컬럼의 교차 빈도, 비율, 덧셈 등을 구할 때 사용
피벗 테이블의 특수 형태
User-item rating matrix 등을 만들 때 유용

→ pivot table과 유사함

merge & concat

merge

sql 에서 많이 사용하는 merge와 같은 기능
두 개의 데이터를 key를 기준으로 하나로 합침
- 양 쪽에 같은 컬럼이 있을 때
- pd.merge(df_a, df_b, on='subject_id')

- 다른 컬럼일 때

    `pd.merge(df_a, df_b, left_on = 'subject_id', right_on='subject_id')`

join

left join
right join
full (outer) join
inner join
index based join : 인덱스 값 기준으로 붙이기

concat

같은 형태의 데이터 붙이기

df_new = pd.concat([df_a, df_b]) # 기본값 axis = 0: 행기준
df_a.append(df_b)
df_new = pd.concat([df_a, df_b]) # 기본값 axis = 1: 열기준

os.listdir

데이터 한 번에 불러오기
import os files = [file_name for file_name in os.listdir("./data") if file_name.endswith("xlsx") files

persistence

파일에 영속성을 부여!

sqlite3

데이터베이스 연결해서 특정한 정보 뽑아낼 수 있음

import sqlite3

conn = sqlite3.connect("./data/flights.db") # 일반적으로는 원격지의 서버랑 연결
cur = conn.cursor()
cur.execute("select * from airlines limit 5;")
results = cur.fetchall()
results

# db 연결 conn을 사용해 dataframe 생성
df_airplines = pd.read_sql_query("select * from airlines;", conn)
df_airports = pd.read_sql_query("select * from airports;", conn)
df_routes = pd.read_sql_query("select * from routes;", conn)

xls 엔진

dataframe의 엑셀 추출 코드
openpyxls, XlsxWrite 사용

!conda install --y XlsxWriter

writer = pd.ExcelWriter("./data/df_routes.xlsx", engine="xlsxwriter")
df_routes.to_excel(writer, sheet_name="Sheet1")

pickle

가장 일반적인 python 파일 persistence
df_routes.to_pickle("./data/df_route.pickle") df_routes_pickle = pd.read_pickle("./data/df_routes.pickle") df_routes.pickle.head()

저작자표시 (새창열림)

'NLP > AI 이론' 카테고리의 다른 글

[AI Math] 경사하강법 (1)	2024.01.28
[AI Math] 벡터와 행렬의 개념 (1)	2024.01.28
[Python] Python Data Handling (1)	2024.01.28
[Python] Python File/Exception/Log Handler (2)	2024.01.28
[Python] OOP + 파이썬 모듈화 (1)	2024.01.28

그냥이것저것

[Python] NumPy & Pandas

numpy

ndarray

handling shape

Indexing & Slicing

creation function

operation functions

array operations

comparisons

fancy index

numpy data i/o

Pandas

pandas 구성 요소

indexing (loc, iloc)

selection & drop

dataframe operations

Series + DataFrame

lambda, map, apply

pandas built-in functions

groupby

grouped

pivot table

crosstab

merge & concat

merge

join

concat

os.listdir

persistence

sqlite3

xls 엔진

pickle

'NLP > AI 이론' 카테고리의 다른 글

티스토리툴바

[Python] NumPy & Pandas

numpy

ndarray

handling shape

Indexing & Slicing

creation function

operation functions

array operations

comparisons

fancy index

numpy data i/o

Pandas

pandas 구성 요소

indexing (loc, iloc)

selection & drop

dataframe operations

Series + DataFrame

lambda, map, apply

pandas built-in functions

groupby

grouped

pivot table

crosstab

merge & concat

merge

join

concat

os.listdir

persistence

sqlite3

xls 엔진

pickle

'NLP > AI 이론' 카테고리의 다른 글

관련글

티스토리툴바