Python – why should I make a copy of a data frame in pandas


When selecting a sub dataframe from a parent dataframe, I noticed that some programmers make a copy of the data frame using the .copy() method. For example,

X = my_dataframe[features_list].copy()

…instead of just

X = my_dataframe[features_list]

Why are they making a copy of the data frame? What will happen if I don't make a copy?

Best Answer

This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1

You'll get:

0 -1
1  2

In contrast, the following leaves df unchanged:

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1

This answer has been deprecated in newer versions of pandas. See docs

Related Question