There is a lot of great Python and Pandas code snippets, but I am not sure anynone posted a Numpy based solution.
Below is mine. It gives 100x speed up over the base (quadratic Python) solution.
Still, suggested Pure Python solutions and idiomatic Numpy solutions are much faster. I suspect Numpy has more power than that.
def gen_stats_numpy_l(dataset_numpy): start = time.time() unique_products,unique_indices = np.unique(dataset_numpy[:,0],return_index = True) product_stats = [] split = np.split(dataset_numpy,unique_indices)[1:] for item in split: length = len(item) product_stats.append([int(item[0,0]),int(length),int(np.sum(item[:,2])),float(np.round(np.sum(item[:,3])/length,2))]) end = time.time() working_time = end-start return product_stats,working_time