python - Speeding up Loading of Pandas Sparse DataFrame -
i have large pickled sparse dataframe generated, since big hold in memory, had incrementally append generated, follows:
with open(data.pickle, 'ab') output: pickle.dump(df.to_sparse(), output, pickle.highest_protocol)
then in order read file following:
df_2 = pd.dataframe([]).to_sparse() open(data.pickle, 'rb') pickle_file: try: while true: test = pickle.load(pickle_file) df_2 = pd.concat([df_2, test], ignore_index= true) except eoferror: pass
given size of file(20 gb), method works, takes really long time. possible parallelize pickle.load/pd.concat steps quicker loading time? or there other suggestions speeding process up, on loading part of code.
note: generation step done on computer less resources, that's why load step, done on more powerful machine, can hold df in memory.
thanks!
don't concat in loop! note in docs, maybe should warning
df_list = [] open(data.pickle, 'rb') pickle_file: try: while true: test = pickle.load(pickle_file) df_list.append(test) except eoferror: pass df_2 = pd.concat(df_list), ignore_index= true)
you making copy of frame each time in loop now, , growing, not efficient @ all.
the idiom append list, single concat @ end.
furthermore, going better off writing hdf5
file in data generation. faster, , compressible. can away writing full df, unless extremely sparse when turn on compression.
Comments
Post a Comment