matlab - How can I merge together two co-occurrence matrices with overlapping but not identical vocabularies? -


i'm looking @ word co-occurrence in number of documents. each set of documents, find vocabulary of n frequent words. make nxn matrix each document representing whether words occur in same context window (sequence of k words). sparse matrix, if have m documents, have nxnxm sparse matrix. because matlab cannot store sparse matrices more 2 dimensions, flatten matrix (nxn)xm sparse matrix.

i face problem generated 2 of these co-occurrence matrices different sets of documents. because sets different, vocabularies different. instead of merging sets of documents , recalculating co-occurrence matrix, i'd merge 2 existing matrices together.

for example,

n = 5; % size of vocabulary m = 5; % number of documents = ones(n*n, m); % flattened (n, n, m) matrix b = 2*ones(n*n, m); % b flattened (n, n, m) matrix a_ind = {'a', 'b', 'c', 'd', 'e'}; % vocabulary labels b_ind = {'a', 'f', 'b', 'c', 'g'}; % vocabulary labels b 

should merge produce (49, 5) matrix, each (49, 1) slice can reshaped (7,7) matrix following structure.

         b     c     d     e     f     g  __________________________________________  a|   3     3     3     1     1     2     2  b|   3     3     3     1     1     2     2  c|   3     3     3     1     1     2     2  d|   1     1     1     1     1     0     0  e|   1     1     1     1     1     0     0  f|   2     2     2     0     0     2     2  g|   2     2     2     0     0     2     2 

where , b overlap, co-occurrence counts should added together. otherwise, elements should counts or counts b. there elements (0's in example) don't have count statistics because of vocabulary exclusively in , exclusively in b.

the key use ability of logical indices flattened.

a = ones(25, 5); b = 2*ones(25,5); a_ind = {'a', 'b', 'c', 'd', 'e'}; b_ind = {'a', 'f', 'b', 'c', 'g'};  new_ind = [a_ind, b_ind(~ismember(b_ind, a_ind))]; new_size = length(new_ind)^2;  new_array = zeros(new_size, 5);   % find indices correspond elements of a_overlap = double(ismember(new_ind, a_ind));  a_mask = (a_overlap'*a_overlap)==1;  % find indices correspond elements of b b_overlap = double(ismember(new_ind, b_ind));  b_mask = (b_overlap'*b_overlap)==1;  % flatten logical indices assign elements new array new_array(a_mask(:), :) = a; new_array(b_mask(:), :) = new_array(b_mask(:), :) + b; 

Comments

Popular posts from this blog

java - Could not locate OpenAL library -

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

sorting - opencl Bitonic sort with 64 bits keys -