scala - Apache Spark RDD - not updating -


i create pairrdd contains vector.

var newrdd = oldrdd.mapvalues(listofitemsandratings => vector(array.fill(2){math.random}))

later on update rdd:

newrdd.lookup(ratingobject.user)(0) += 0.2 * (errorrate(rating) * myvector)

however, although outputs updated vector (as shown in console), when next call newrdd can see vector value has changed. through testing have concluded has changed given math.random - every time call newrdd vector changes. understand there lineage graph , maybe has it. need update vector held in rdd new values , need repeatedly.

thanks.

rdd immutable structures meant distribute operations on data on cluster. there're 2 elements playing role in behavior observing here:

rdd lineage may computed every time. in case, means action on newrdd might trigger lineage computation, therefore applying vector(array.fill(2){math.random}) transformation , resulting in new values each time. lineage can broken using cache, in case value of transformation kept in memory and/or disk after first time it's applied. results in:

val randomvectorrdd = oldrdd.mapvalues(listofitemsandratings => vector(array.fill(2){math.random})) randomvectorrdd.cache() 

the second aspect needs further consideration on-site mutation:

newrdd.lookup(ratingobject.user)(0) += 0.2 * (errorrate(rating) * myvector) 

although might work on single machine because vector references local, not scale cluster lookup references serialized , mutations not preserved. therefore bears question of why use spark this.

to implemented on spark, algorithm need re-design in order expressed in terms of transformations instead of punctual lookup/mutations.


Comments

Popular posts from this blog

java - Could not locate OpenAL library -

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

sorting - opencl Bitonic sort with 64 bits keys -