parseName {Ecdat} | R Documentation |
Identify the presumed surname in a character string assumed to represent a name and return the result in a character matrix with "surname" followed by "givenName".
parseName(x, surnameFirst=(median(regexpr(',', x))>0), suffix=c('Jr.', 'I', 'II', 'III', 'IV', 'Sr.'), fixNonStandard=subNonStandardNames, ...)
x |
a character vector |
surnameFirst |
logical: If TRUE, the surname comes first followed by a comma
(","), then the given name. If FALSE, parse the surname from a
standard Western "John Smith, Jr." format. If
|
suffix |
character vector of strings that are NOT a surname but might appear at the end without a comma that would otherwise identify it as a suffix. |
fixNonStandard |
function to look for and repair nonstandard names such as names
containing characters with accent marks that are sometimes mangled
by different software. Use |
... |
optional arguments passed to |
If surnameFirst
is FALSE
:
1. If the last character is ")" and the matching "(" is 3 characters earlier, drop all that stuff. Thus, "John Smith (AL)" becomes "John Smith".
2. Look for commas to identify a suffix like Jr. or III; remove and call the rest x2.
3. split <- strsplit(x2, " ")
4. Take the last as the surname.
5. If the "surname" found per 3 is in suffix
, save to append
it to the givenName
and recurse to get the actual surname.
NOTE: This gives the wrong answer with double surnames written without a hyphen in the Spanish tradition, in which, e.g., "Anistasio Somoza Debayle", "Somoza Debayle" give the (first) surnames of Anistasio's father and mother, respectively: The current algorithm would return "Debayle" as the surname, which is incorrect.
6. Recompose the rest with any suffix as the givenName.
a character matrix with two columns: surname and givenName
Spencer Graves
## ## 1. Parse standard first-last name format ## tst <- c('Joe Smith (AL)', 'Teresa Angelica Sanchez de Gomez', 'John Brown, Jr.', 'John Brown Jr.', 'John W. Brown III', 'John Q. Brown,I', 'Linda Rosa Smith-Johnson', 'Anastasio Somoza Debayle', 'Ra_l Vel_zquez') parsed <- parseName(tst) tst2 <- matrix(c('Smith', 'Joe', 'Gomez', 'Teresa Angelica Sanchez de', 'Brown', 'John, Jr.', 'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I', 'Smith-Johnson', 'Linda Rosa', 'Debayle', 'Anastasio Somoza', 'Velazquez', 'Raul'), ncol=2, byrow=TRUE) # NOTE: This second to last example is in the Spanish tradition # and is handled incorrectly by the current algorithm. # The correct answer should be "Somoza Debayle", "Anastasio". # However, fixing that would complicate the algorithm excessively for now. colnames(tst2) <- c("surname", 'givenName') all.equal(parsed, tst2) ## ## 2. Parse "surname, given name" format ## tst3 <- c('Smith (AL), Joe', 'Sanchez de Gomez, Teresa Angelica', 'Brown,John, Jr.', 'Brown, John W., III', 'Brown, John Q., I', 'Smith-Johnson, Linda Rosa', 'Somoza Debayle, Anastasio', 'Vel_zquez, Ra_l') tst4 <- parseName(tst3) tst5 <- matrix(c('Smith', 'Joe', 'Sanchez de Gomez', 'Teresa Angelica', 'Brown', 'John, Jr.', 'Brown', 'John W., III', 'Brown', 'John Q., I', 'Smith-Johnson', 'Linda Rosa', 'Somoza Debayle', 'Anastasio', 'Velazquez', 'Raul'), ncol=2, byrow=TRUE) colnames(tst5) <- c("surname", 'givenName') all.equal(tst4, tst5)