====== Pandas ======

Python기반 데이터 분석 라이브러리

  * [[http://pandas.pydata.org/|Pandas]]

상관관계 분석

<code python>

import matplotlib.pyplot as plt
import pandas as pd
from pandas.tools.plotting import scatter_matrix

infile = 'test-in.csv'
outfile = 'test-out.csv'

df = pd.read_csv(infile)
coff = df.corr()
coff.to_csv(outfile)

scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')
plt.savefig('test-fig.png')

print(df)

print(coff)
</code>


  * 데이터 전처리
    * [[https://towardsdatascience.com/the-simple-yet-practical-data-cleaning-codes-ad27c4ce0a38|The Simple Yet Practical Data Cleaning Codes]]
  * 최소문법
    * https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428

===== multiprocessing =====

  * https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1

===== Out-of-Memory, Out-of-cores =====

병렬/분산처리로 더 빠른 Dataframe를 제공하는 도구들

  * Polars
    * https://www.pola.rs/
    * https://betterprogramming.pub/this-library-is-15-times-faster-than-pandas-7e49c0a17adc
  * Modin
    * https://github.com/modin-project/modin
    * https://towardsdatascience.com/get-faster-pandas-with-modin-even-on-your-laptops-b527a2eeda74
  * Dask
  * Vaex
    * https://towardsdatascience.com/how-to-process-a-dataframe-with-billions-of-rows-in-seconds-c8212580f447


===== 속도 개선 =====

  * [[https://github.com/jmcarpenter2/swifter|Swifter]]


{{tag>pandas dataframe}}