Understanding DataFrame Concatenation

Why You Should Avoid Using DataFrame Concatenation Inside Loops or Recursion?

When you concatenate two DataFrames, you essentially create a new DataFrame that includes the rows or columns of both. This operation requires allocating new memory and copying the data from the original DataFrames to the new one.

If you have two DataFrames, `df1`

with `n`

rows and `df2`

with `m`

rows, the concatenation operation has a complexity of O(n + m), because each row from `df1`

and `df2`

must be copied into the new DataFrame.

Now, consider a loop that runs `k`

times, where in each iteration a new row or a small DataFrame with a constant number of rows `c`

is concatenated to an accumulating DataFrame.

```
result_df = pd.DataFrame()
for i in range(k):
# small_df has c rows
small_df = create_small_df()
result_df = pd.concat([result_df, small_df])
```

In the first iteration, the complexity is O(c) (just the size of `small_df`

). In the second iteration, the complexity is O(c + c) because you’re concatenating the `result_df`

(now with `c`

rows) with another `small_df`

(with `c`

rows). This pattern continues, leading to a cumulative complexity:

O(c) + O(2c) + O(3c) + … + O(kc)

This series is equivalent to O(c(1 + 2 + 3 + … + k)), which simplifies to O(ck(k + 1)/2).

Since we’re interested in the Big O notation, we focus on the highest order term and constants are not considered, so this simplifies further to O(ck²), which is quadratic complexity.

In a recursive function that concatenates a DataFrame on each call, a similar pattern emerges. The complexity increases with the depth of the recursion.

```
def recursive_concat(df, depth):
if depth == 0:
return df
small_df = create_small_df()
return recursive_concat(pd.concat([df, small_df]), depth - 1)
result_df = recursive_concat(pd.DataFrame(), k)
```

The complexity is similar to the loop case, growing quadratically with the depth of the recursion.

In both looping and recursive scenarios, the use of DataFrame concatenation leads to a quadratic increase in complexity (O(ck²)). This is highly inefficient, especially for large values of `k`

, as it involves a substantial amount of data copying and memory allocation. The efficient alternative is to collect data first (in a list or similar structure) and then concatenate it in a single operation outside the loop or recursive calls. This approach reduces the complexity to linear, O(n), where `n`

is the total number of rows to concatenate.