Python pandas 處理大量資料--提速方法

請教一下, 我有個EXCEL FILE 有3個COLUMN: A/B /C,  如果A欄的資料包含C 欄的資料,  B 欄內就放入C 欄資料(對應到A欄位置)。

A 欄資料有3萬筆, C 欄資料有12 萬筆,  用下面段CODE 行左10幾分鐘都未行完, 請問有冇提速方法?
  1. import pandas as pd

  2. df = pd.read_excel('your_file.xlsx')
  3. df['A'] = df['A'].astype(str)
  4. df['B'] = df['B'].astype(str)
  5. df['C'] = df['C'].astype(str)

  6. def update_B(row):
  7.     for idx, a_value in df['A'].items():
  8.         if row['C'] in a_value:
  9.             df.at[idx, 'B'] = row['C']
  10.     return row

  11. df.apply(update_B, axis=1)

  12. df.to_excel('output_file.xlsx', index=False)
複製代碼

Use the Numba library to process data in chunks, use the Modin library, use the Pandera library, and use columnar storage formats such as Parquet. Numba is a Python function compiler that can significantly speed up Pandas operations, especially operations involving numerical calculations. Can be combined with Numba to optimize Pandas custom functions. Use Pandas' vectorized operations whenever possible instead of looping over each row. Vectorization operations can greatly increase calculation speed. If the amount of data is very large, the data can be processed in chunks. This can reduce memory overhead and take advantage of multiple cores. Modin is a Pandas-compatible library that can improve the performance of Pandas through parallel processing. Modin can automatically convert PandasDataFrame to ModinDataFrame to take advantage of parallel processing. Pandera is a Pandas auxiliary library that can help you efficiently validate and process data from PandasDataFrame. This can avoid some unnecessary data conversion and processing, thus improving efficiency. Compared with row storage formats such as Excel, column storage formats such as Parquet are more efficient when reading and querying big data. Consider converting the data to Parquet format.

  1. import pandas as pd
  2. import numba
  3. import modin.pandas as mpd
  4. import pandera as pa

  5. # Read Excel file
  6. df = pd.read_excel('data.xlsx', engine='openpyxl')

  7. # Speed ​​up custom functions with Numba
  8. @numba.jit(nopython=True)
  9. def filter_data(a, c):
  10.     return c[a.isin(c)]

  11. # Optimizing Pandas operations using Numba
  12. df['B'] = df.apply(lambda row: filter_data(df['A'], df['C']), axis=1)

  13. # Speeding up Pandas operations with Modin
  14. mdf = mpd.DataFrame(df)
  15. mdf['B'] = mdf.apply(lambda row: filter_data(mdf['A'], mdf['C']), axis=1)

  16. # Validate DataFrame structure using Pandera
  17. schema = pa.DataFrameSchema(
  18.     columns={
  19.         'A': pa.Column(pa.Int64),
  20.         'B': pa.Column(pa.Int64),
  21.         'C': pa.Column(pa.Int64)
  22.     }
  23. )
  24. validated_df = schema.validate(mdf)

  25. # Convert DataFrame to Parquet format
  26. validated_df.to_parquet('data.parquet')
複製代碼


This code demonstrates the use of the following techniques to speed up pandas processing of large amounts of data: Use the Numba library to optimize the custom filter_data function to improve operation speed. Use the Modin library to improve the parallel processing capabilities of Pandas and speed up the overall processing speed. Use the Pandera library to validate DataFrame structures and avoid unnecessary data conversion and processing. Convert DataFrame to columnar storage format Parquet to improve reading and query efficiency. These techniques can effectively speed up pandas' processing of large amounts of data. Depending on your situation, you can choose to use one or more of these techniques to optimize your code.

TOP