Hi, I found your repo very helpful. I just have some minors suggestions.
First, the Rainfall Data you star with is unconventional, so I proposed to you a simple pandas based function to take a date, value to pass from long format to matrix format.
This assumes that the rainfall data has an smaller than hour granularity.
def long_to_matrix_format(csv_path):
"""
Convert a long format CSV with 'datetime' and 'value' columns
to a matrix format with 'Year', 'Day', and hourly columns.
Parameters:
- csv_path: str, path to the long format CSV
Returns:
- matrix_df: DataFrame in matrix format
"""
# Read the CSV
df = pl.read_csv(csv_path, try_parse_dates=True).to_pandas(())
# Convert the 'datetime' column to a pandas datetime format
df["date"] = pd.to_datetime(df["date"])
# Set the DateTime column as the index
df.set_index("date", inplace=True)
df = df.resample("60min").sum()
df["date"] = df.index.to_series()
# Extract year, day of year, and hour
df["Year"] = df["date"].dt.year
df["Day"] = df["date"].dt.dayofyear
df["Hour"] = (
df["date"].dt.hour + 1
) # +1 to start hours from 1 to 24 instead of 0 to 23
# Pivot the DataFrame to the desired matrix format
matrix_df = df.pivot_table(
index=["Year", "Day"], columns="Hour", values="value", aggfunc="sum"
).reset_index()
# Rearrange columns to the desired order
matrix_df = matrix_df[["Year", "Day"] + [i for i in range(1, 25)]]
matrix_df.columns = ["Year", "Day"] + [f"Hour {i}" for i in range(1, 25)]
# Fill NaN values with 0
matrix_df.fillna(0, inplace=True)
matrix_df["Year"] = matrix_df["Year"] - matrix_df["Year"].min() + 1
return matrix_df.to_numpy().astype(float).copy()#this is important to make use of the numba code.
#-----------------------------------------
Then, the for loop you use to get the data1 dataframe is very slow, because you are using pandas vectorized functions on a (i, j) basis. I have a minor suggestion for this. do it pure python and then compile it in numba.
@numba.njit('float64[:,::1](float64[:,::1])')
@cc.export('nb_get_idf_rolling_sum', 'float64[:,::1](float64[:,::1])')
def get_idf_rolling_sum(df_array):
rows = np.unique(df_array[:, 0]).size
data = np.zeros((rows, 24))
for i in range(rows):
# Filter by year and drop the 'Year' and 'Day' columns
df1_array = df_array[df_array[:, 0] == i + 1][:, 2:].T
# Loop through 0 to 23 to calculate the max rolling sum for each window size
for j in range(24):
max_rolling_sum = np.NINF
# Loop through each column
for col in range(df1_array.shape[1]):
# Loop through each possible window in the current column
for row in range(df1_array.shape[0] - j):
rolling_sum = np.sum(df1_array[row : row + j + 1, col])
# Update max rolling sum if this sum is greater
if rolling_sum > max_rolling_sum:
max_rolling_sum = rolling_sum
# Store the max rolling sum in the data array
data[i, j] = max_rolling_sum
return data
The numba function replaces this part.
---> for i in range(99):
df1 = (((df.where(df['Year']==i+1)).dropna()).drop(['Year','Day'],axis=1)).T
for j in range(24):
data[i][j] = max((df1.rolling(j+1).sum()).max())
#-------------------
This a time test for a 10 millon rows of a csv file of the long format meaning:
date,value
1981-01-01 01:00:00, 2.5
1981-01-01 02:00:00, 2.5
1981-01-01 03:00:00, 2.5
1981-01-01 04:00:00, 2.5
.
.
.
nth 10 million
#convert long format to matrix format
t0 = 0.5562264919281006
#proces roliing sum
t1 = 0.024176836013793945
This two functions use polars, numpy, numba and pandas.
I would be nice if you turn this into a python package maybe adding a couple of features, there are not many python packages for IDF curves construction.
Have a nice day,
Marcelo.