Python Pandas Dataframe — Tips and Tricks for Speed Optimization

Flow
5 min readMay 7, 2019

Last Updated: October 3, 2020

When I first started out using Pandas Dataframe, I naively coded using standard for loops, etc to iterate through the dataframe to get the value that I wanted.

With many of these specialized structures, pandas dataframe has its own optimized iterations that you can take advantage of in order to enhance your program’s speed. For small programs, this may not be an issue but as you scale the difference in speed can range from seconds to minutes, or from hours to days.

This following article is intended to provide minor tips and tricks to enhance your speed.

Anytime, when working with any data structure, always keep in mind the difference between loading and retrieving data and the balance between the two.

General Rule of thumb: When possible, avoid for-looping dataframes. There is almost always a faster method that dataframes offer.

Get specific value if you know the column name and row index:

df.loc[ROW, COLUMN]: Think of ROW & COLUMN as a list. You can make any list combination.

Get specific value if you know the column index and row index:

When you have column in index form, you have to use “.iloc” rather than “.loc”

Get specific value if you only know the column name:

df[CONDITION]: The CONDITION is any combination you want. Use “&” for AND, and “|” for OR.

Get specific value if you only know the row index:

  • (Will add in the future)

Get specific value if you only know the column index:

  • (Will add in the future): df_new = df[df.iloc[1:2] == 4]

Get specific values based on specific criteria:

df[CONDITION]: The CONDITION is any combination you want. Use “&” for AND, and “|” for OR.

Get first element of same values:

Get rid of copy warning error (reset index):

Use reset_index()

Merge dataframes with same columns, stack vertically by row

pd.concat([LIST OF DATAFRAMES]): It will append the dataframes vertically

Sort by two different criteria by column (if first criteria is the same):

  • (Will add in the future): df = df.sort_values([‘val1’, ‘val2’], ascending=[False, False])

Sort by row:

  • (Will add in the future)

Remove all NaN values from dataframe

  • (Will add in the future)

Replace all NaN values from dataframe

  • (Will add in the future)

Replace specific values with other values

  • (Will add in the future)

Perform operation on subset of dataframe

  • (Will add in the future)

Get the largest/smallest value of each row in dataframe

  • (Will add in the future): df_new = df.max(axis = 1)

Get the nth-largest/nth-smallest value of each row in dataframe

  • (Will add in the future)

Perform operation on specific row on dataframe

  • (Will add in the future)

Perform operation on each row on dataframe

  • (Will add in the future)

Add a new column, if specific condition met = value, else = other value

  • (Will add in the future)

Add a new column, if many different specific condition= different values

  • (Will add in the future)

Return dataframe if specific dataframe values exists in list

  • (Will add in the future)
  • new_df= df[df[‘val1’].isin(list(df[‘val1’]))]

Shuffle dataframe rows

df = df.sample(frac=1).reset_index(drop=True)

Further edits will be made in the future

If you want to see any additional, tips/tricks added or if you have a particularly difficult question you want answered, feel free to comment below and I will add it to this post.

--

--