A Fast Way to Save Pandas DataFrames to PostgreSQL

I found Pandas’ built-in DataFrame.to_sql() too slow for my use case, so here’s a faster way to load a DataFrame to a PostgreSQL database using the COPY command.

from io import StringIO

from pandas import DataFrame

from config import postgresql_engine


def save_dataframe_to_postgres(dataframe: DataFrame, table_name: str):
    """Saves DataFrame to Postgres, overwriting the existing table."""
    dataframe = dataframe.reset_index()  # This turns the index into a normal column
    # Create empty table (drops it before if it exists)
    dataframe[0:0].to_sql(table_name, con=postgresql_engine, if_exists='replace')
    # Save the CSV to a string
    f = StringIO()
    dataframe.to_csv(f, header=False, index=False)
    f.seek(0)
    # This next step uses psycopg's copy function, won't work with other engines
    fmt = """COPY "{}" ({}) FROM STDIN WITH (FORMAT CSV, HEADER FALSE)"""
    fields = ', '.join('"{}"'.format(x) for x in dataframe.columns)
    query = fmt.format(table_name, fields)
    connection = postgresql_engine.raw_connection()
    with connection.cursor() as cursor:
        cursor.copy_expert(query, f)
    connection.commit()

There are some tradeoffs. The DF has to fit in memory and it relies on psycopg copy, it won’t work with other connectors or DBs. It might be optimized further by splitting the data in chunks and saving in parallell, but this was good enough for my needs.

Python SQL