python - Speeding up Loading of Pandas Sparse DataFrame -


i have large pickled sparse dataframe generated, since big hold in memory, had incrementally append generated, follows:

with open(data.pickle, 'ab') output:     pickle.dump(df.to_sparse(), output, pickle.highest_protocol) 

then in order read file following:

df_2 = pd.dataframe([]).to_sparse() open(data.pickle, 'rb') pickle_file:     try:         while true:             test =  pickle.load(pickle_file)             df_2 = pd.concat([df_2, test], ignore_index= true)     except eoferror:         pass 

given size of file(20 gb), method works, takes really long time. possible parallelize pickle.load/pd.concat steps quicker loading time? or there other suggestions speeding process up, on loading part of code.

note: generation step done on computer less resources, that's why load step, done on more powerful machine, can hold df in memory.

thanks!

don't concat in loop! note in docs, maybe should warning

df_list = [] open(data.pickle, 'rb') pickle_file:     try:         while true:             test =  pickle.load(pickle_file)             df_list.append(test)     except eoferror:         pass  df_2 = pd.concat(df_list), ignore_index= true) 

you making copy of frame each time in loop now, , growing, not efficient @ all.

the idiom append list, single concat @ end.

furthermore, going better off writing hdf5 file in data generation. faster, , compressible. can away writing full df, unless extremely sparse when turn on compression.


Comments

Popular posts from this blog

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

java - Could not locate OpenAL library -

sorting - opencl Bitonic sort with 64 bits keys -