r - Using name full name and maiden name strings (and birthdays) to match individuals across time -
i've got set of 20 or consecutive individual-level cross-sectional data sets link together.
unfortunately, there's no time-stable id number; there are, however, fields first, last, , maiden names, year of birth--this should allow pretty high (90-95%) match rate, presume.
ideally, create time-independent id each unique individual.
i can marital status (maiden name) not change pretty in r--stack data sets long panel, effect of:
unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.i]
(i'm of course using r data.table
), merging full data.
however, i'm stuck on how incorporate maiden name procedure. suggestions?
here's preview of data:
first_name last_name nee birth_year year 1: eileen aaldxxxx dxxxx 1977 2002 2: eileen aaldxxxx dxxxx 1977 2003 3: sarah aaxxxx gexxxx 1974 2003 4: kelly aaxxxx nxxxx 1951 2008 5: linda aarxxxx-gxxxx aarxxxx 1967 2008 --- 72008: stacey zwirxxxx kruxxxx 1982 2010 72009: stacey zwirxxxx kruxxxx 1982 2011 72010: stacey zwirxxxx kruxxxx 1982 2012 72011: stacey zwirxxxx kruxxxx 1982 2013 72012: jill zydoxxxx gundexxxx 1978 2002
update:
i've done lot of chipping , hammering @ problem; here's i've got far. appreciate comments possible improvements code far.
i'm still missing 3-5% of matches due inexact matches ("tonya"
vs. "tanya"
, "jenifer"
vs. "jennifer"
); haven't come clean way of doing fuzzy matching on stragglers, there's room better matching in direction if anyone's got straightforward way implement that.
the basic approach build cumulatively--assign ids in first year, matches in second year; assign new ids unmatched. year 3, @ first 2 years, etc. how match, idea expand matching criteria--the idea being more robust match, lower chances of mismatching accidentally (particularly worried john smith
s).
without further ado, here's main function matching pair of data sets:
get_id<-function(yr,key_from,key_to=key_from, mdis,msch,mard,init,mexp,step){ #want exclude matched existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))] #get recent prior observation of # unmatched teachers, excluding teachers # cannot uniquely identified # current key setting unmatched<- full_data[.(1996:(yr-1)) ][!teacher_id %in% existing_ids, .sd[.n],by=teacher_id, .sdcols=c(key_from,"teacher_id") ][,if (.n==1l) .sd,keyby=key_from ][,(flags):=list(mdis,msch,mard,init,mexp,step)] #merge, reset keys setkey(setkeyv( full_data,key_to)[year==yr&is.na(teacher_id), (update_cols):=unmatched[.sd,update_cols,with=f]], year) full_data[.(yr),(update_cols):=lapply(.sd,function(x)na.omit(x)[1]), by=id,.sdcols=update_cols] }
then go through 19 years yy
in for
loop, running 12 progressively looser matches, e.g. step 3 is:
get_id(yy,c("first_name_clean","last_name_clean","birth_year"), mdis=t,msch=t,mard=f,init=f,mexp=f,step=3l)
the final step assign new ids:
current_max<-full_data[.(yy),max(teacher_id,na.rm=t)] new_ids<- setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id)) ][,add_id:=.i+current_max],id) setkey(setkey(full_data,id)[year==yy&is.na(teacher_id), teacher_id:=new_ids[.sd,add_id]],year)
Comments
Post a Comment