Neo4j cypher query efficiency and syntax -
i attempting query ontology of health represented acyclic, directed graph in neo4j v2.1.5. database consists of 2 million nodes , 5 million edges/relationships. following query identifies nodes subsumed disease concept , caused particular bacteria or of bacteria subtypes follows:
match p = (a:objectconcept{disease}) <-[:isa*]- (b:objectconcept), q=(c:objectconcept{bacteria})<-[:isa*]-(d:objectconcept) not (b)-->()--(c) , not (b)-->()-->(d) return distinct b.sctid, b.fsn
this query runs in < 1 second , returns correct answers. however, adding 1 additional parameter adds substantial time (20 minutes). example:
match p = (a:objectconcept{disease}) <-[:isa*]- (b:objectconcept), q=(c:objectconcept{bacteria})<-[:isa*]-(d:objectconcept), t=(e:objectconcept{bacteria})<-[:isa*]-(f:objectconcept), not (b)-->()--(c) , not (b)-->()-->(d) , not (b)-->()-->(e) , not (b)-->()-->(f) return distinct b.sctid, b.fsn
i new cypher coding, have imagine there better way write query more efficient. how collections improve this?
thanks
i answered on google group:
hi scott,
i presume created indexes or constraints :objectconcept(name)
?
i working acyclic, directed graph (an ontology) models human health , needing identify diseases (example: pneumonia) infectious not caused bacteria (staph or streptococcus). concepts nodes defined objectconcepts. objectconcepts connected relationships such [isa], [pathological_process], [causative_agent], etc.
the query requires:
a) identification of concepts subsumed concept pneumonia follows:
match p = (a:objectconcept{pneumonia}) <-[:isa*]- (b:objectconcept) returns number of paths, potentially millions, can check match p = (a:objectconcept{pneumonia}) <-[:isa*]- (b:objectconcept) return count(*)
b) identification of concepts subsumed genus staph , genus strep (including concept genus staph , genus strep) follows. note:
with b match (b) q = (c:objectconcept{strep})<-[:isa*]-(d:objectconcept), h = (e:objectconcept{staph})<-[:isa*]-(f:objectconcept)
this cross product of paths "p", "q" , "h", e.g. if 3 of them return 1000 paths, you're @ 1bn paths !!
c) identify nodes(p) not have causative agent of strep (i.e., nodes(q)) or staph (nodes(h)) follows:
with b,c,d,e,f match (b),(c),(d),(e),(f) (b)--()-->(c) or (b)-->()-->(d) or (b)-->()-->(e) or (b)-->()-->(f) return distinct b.name;
you don't need or match (b),(c),(d),(e),(f)
what connections there between b , other nodes ? have concrete ones? first there missing 1 direction.
the clause can problem, in general want show perhaps query better reproduced union of simpler matches
e.g
match (a:objectconcept{pneumonia}) <-[:isa*]- (b:objectconcept)-->()-->(c:objectconcept{name:strep}) return b.name union match (a:objectconcept{pneumonia}) <-[:isa*]- (b:objectconcept)-->()-->(e:objectconcept{name:staph}) return b.name union match (a:objectconcept{pneumonia}) <-[:isa*]- (b:objectconcept)-->()-->(d:objectconcept)-[:isa*]->(c:objectconcept{name:strep}) return b.name union match (a:objectconcept{pneumonia}) <-[:isa*]- (b:objectconcept)-->()-->(d:objectconcept)-[:isa*]->(c:objectconcept{name:staph}) return b.name
another option utilize shortestpath() function find 1 or shortest path(s) between pneumonia , bacteria rel-types , direction.
perhaps can share dataset , expected result.
Comments
Post a Comment