scala - Apache Spark RDD - not updating -
i create pairrdd contains vector.
var newrdd = oldrdd.mapvalues(listofitemsandratings => vector(array.fill(2){math.random}))
later on update rdd:
newrdd.lookup(ratingobject.user)(0) += 0.2 * (errorrate(rating) * myvector)
however, although outputs updated vector (as shown in console), when next call newrdd
can see vector value has changed. through testing have concluded has changed given math.random
- every time call newrdd
vector changes. understand there lineage graph , maybe has it. need update vector held in rdd new values , need repeatedly.
thanks.
rdd immutable structures meant distribute operations on data on cluster. there're 2 elements playing role in behavior observing here:
rdd lineage may computed every time. in case, means action on newrdd might trigger lineage computation, therefore applying vector(array.fill(2){math.random})
transformation , resulting in new values each time. lineage can broken using cache
, in case value of transformation kept in memory and/or disk after first time it's applied. results in:
val randomvectorrdd = oldrdd.mapvalues(listofitemsandratings => vector(array.fill(2){math.random})) randomvectorrdd.cache()
the second aspect needs further consideration on-site mutation:
newrdd.lookup(ratingobject.user)(0) += 0.2 * (errorrate(rating) * myvector)
although might work on single machine because vector references local, not scale cluster lookup references serialized , mutations not preserved. therefore bears question of why use spark this.
to implemented on spark, algorithm need re-design in order expressed in terms of transformations instead of punctual lookup/mutations.
Comments
Post a Comment