python - Speeding up Loading of Pandas Sparse DataFrame -

- May 15, 2010

i have large pickled sparse dataframe generated, since big hold in memory, had incrementally append generated, follows:

with open(data.pickle, 'ab') output:     pickle.dump(df.to_sparse(), output, pickle.highest_protocol)

then in order read file following:

df_2 = pd.dataframe([]).to_sparse() open(data.pickle, 'rb') pickle_file:     try:         while true:             test =  pickle.load(pickle_file)             df_2 = pd.concat([df_2, test], ignore_index= true)     except eoferror:         pass

given size of file(20 gb), method works, takes really long time. possible parallelize pickle.load/pd.concat steps quicker loading time? or there other suggestions speeding process up, on loading part of code.

note: generation step done on computer less resources, that's why load step, done on more powerful machine, can hold df in memory.

thanks!

don't concat in loop! note in docs, maybe should warning

df_list = [] open(data.pickle, 'rb') pickle_file:     try:         while true:             test =  pickle.load(pickle_file)             df_list.append(test)     except eoferror:         pass  df_2 = pd.concat(df_list), ignore_index= true)

you making copy of frame each time in loop now, , growing, not efficient @ all.

the idiom append list, single concat @ end.

furthermore, going better off writing hdf5 file in data generation. faster, , compressible. can away writing full df, unless extremely sparse when turn on compression.

Search This Blog

Print F

python - Speeding up Loading of Pandas Sparse DataFrame -

Comments

Post a Comment

Popular posts from this blog

node.js - How to mock a third-party api calls in the backend -

node.js - Why do I get "SOCKS connection failed. Connection not allowed by ruleset" for some .onion sites? -

matlab - 0-by-1 sym - What do I need to change in order to get proper symbolic results? -