r - Using name full name and maiden name strings (and birthdays) to match individuals across time -


i've got set of 20 or consecutive individual-level cross-sectional data sets link together.

unfortunately, there's no time-stable id number; there are, however, fields first, last, , maiden names, year of birth--this should allow pretty high (90-95%) match rate, presume.

ideally, create time-independent id each unique individual.

i can marital status (maiden name) not change pretty in r--stack data sets long panel, effect of:

unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.i] 

(i'm of course using r data.table), merging full data.

however, i'm stuck on how incorporate maiden name procedure. suggestions?

here's preview of data:

       first_name     last_name       nee birth_year year     1:     eileen      aaldxxxx     dxxxx       1977 2002     2:     eileen      aaldxxxx     dxxxx       1977 2003     3:      sarah        aaxxxx    gexxxx       1974 2003     4:      kelly        aaxxxx     nxxxx       1951 2008     5:      linda aarxxxx-gxxxx   aarxxxx       1967 2008    ---                                                    72008:     stacey      zwirxxxx   kruxxxx       1982 2010 72009:     stacey      zwirxxxx   kruxxxx       1982 2011 72010:     stacey      zwirxxxx   kruxxxx       1982 2012 72011:     stacey      zwirxxxx   kruxxxx       1982 2013 72012:       jill      zydoxxxx gundexxxx       1978 2002 

update:

i've done lot of chipping , hammering @ problem; here's i've got far. appreciate comments possible improvements code far.

i'm still missing 3-5% of matches due inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); haven't come clean way of doing fuzzy matching on stragglers, there's room better matching in direction if anyone's got straightforward way implement that.

the basic approach build cumulatively--assign ids in first year, matches in second year; assign new ids unmatched. year 3, @ first 2 years, etc. how match, idea expand matching criteria--the idea being more robust match, lower chances of mismatching accidentally (particularly worried john smiths).

without further ado, here's main function matching pair of data sets:

get_id<-function(yr,key_from,key_to=key_from,                  mdis,msch,mard,init,mexp,step){   #want exclude matched   existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]   #get recent prior observation of   #  unmatched teachers, excluding teachers   #  cannot uniquely identified   #  current key setting   unmatched<-     full_data[.(1996:(yr-1))               ][!teacher_id %in% existing_ids,                 .sd[.n],by=teacher_id,                 .sdcols=c(key_from,"teacher_id")                 ][,if (.n==1l) .sd,keyby=key_from                   ][,(flags):=list(mdis,msch,mard,init,mexp,step)]   #merge, reset keys   setkey(setkeyv(     full_data,key_to)[year==yr&is.na(teacher_id),                       (update_cols):=unmatched[.sd,update_cols,with=f]],     year)   full_data[.(yr),(update_cols):=lapply(.sd,function(x)na.omit(x)[1]),                                         by=id,.sdcols=update_cols] } 

then go through 19 years yy in for loop, running 12 progressively looser matches, e.g. step 3 is:

get_id(yy,c("first_name_clean","last_name_clean","birth_year"),        mdis=t,msch=t,mard=f,init=f,mexp=f,step=3l) 

the final step assign new ids:

current_max<-full_data[.(yy),max(teacher_id,na.rm=t)] new_ids<-   setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))                    ][,add_id:=.i+current_max],id) setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),                             teacher_id:=new_ids[.sd,add_id]],year)     


Comments

Popular posts from this blog

c++ - Delete matches in OpenCV (Keypoints and descriptors) -

java - Could not locate OpenAL library -

sorting - opencl Bitonic sort with 64 bits keys -