Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing values in one DF with values from another DF by matching #37

Open
mattnuttall00 opened this issue Jan 27, 2020 · 5 comments
Open
Labels
data wrangling question Further information is requested

Comments

@mattnuttall00
Copy link

Hi all,

What I am trying to do is quite simple, and I thought I knew how to do it, but it's not working...!

I have two dataframes. Below are reproducible dataframes which demonstrate the example:

df1 <- data.frame(year = rep(c(2007,2008,2009,2010,2011), each=2), comm_name = c("a","b",NA,"d","e",NA,"g","h",NA,"j"), code_com = c(1:10))

df2 <- data.frame(comCode = c(1:10), comm_name = letters[1:10])

All I am trying to do is to replace the NAs in df1$comm_name with the value from df2$comm_name, based on the com_code. i.e. code_com in df1 and comCode in df2 are the same variable, so I want to use those values to match the the correct comm_name from df2 to df1.

I have tried:
df1$comm_name[is.na(df1$comm_name)] <df2$comm_name[match(df1$code_com,df2$comCode)]

and

df1$comm_name <- ifelse(is.na(df1$comm_name), df2$comm_name[match(df2$comCode,df1$code_com)], df1$comm_name)

and

df1$comm_name <- replace(df1$comm_name, is.na(df1$comm_name), df2$comm_name[match(df1$code_com,df2$comCode)])

and

df1 <- df1 %>% mutate(comm_name = ifelse(is.na(df1$comm_name), df2$comm_name[match(df1$code_com,df2$comCode)], comm_name))

and

df1 <- df1 %>% mutate(comm_name = ifelse(is.na(df1$comm_name), replace(comm_name, is.na(comm_name), df2$comm_name[match(df1$code_com,df2$comCode)]), comm_name))

And none of them are doing what I want. Not sure if I am over-complicating it, or if I am just making some silly mistake in my code....but all suggestions welcome!

Thanks

@mattnuttall00 mattnuttall00 added question Further information is requested data wrangling labels Jan 27, 2020
@jejoenje
Copy link
Member

Hi Matt,

I think this does want you want it to do?

df1 <- data.frame(year = rep(c(2007,2008,2009,2010,2011), each=2), 
                  comm_name = c("a","b",NA,"d","e",NA,"g","h",NA,"j"), 
                  code_com = c(1:10))

df2 <- data.frame(comCode = c(1:10), comm_name = letters[1:10])

df1$comm_name = df2$comm_name[match(df1$code_com, df2$comCode)]
df1

Basically, match() returns indexes where it found a match between vector A and B, so you need to use the output of that function as an "indexer" for where you want to place the values.
It's a bit odd to get your head around, but have a look at what the output of match(df1$code_com, df2$comCode) is.

Hope that helps...

J

@mattnuttall00
Copy link
Author

Thanks @jejoenje. Am I correct in thinking though that
df1$comm_name = df2$comm_name[match(df1$code_com, df2$comCode)]
replaces all of the df1$comm_name values with the values from df2$comm_name? How would I then specify that I only wanted it do do that when df1$comm_name values are NA?

@jejoenje
Copy link
Member

jejoenje commented Jan 27, 2020

Ah, yes, it does indeed replace all of them. In the example case, though, that doesn't matter, as your "index" (i.e. df2) has all the necessary values. I assume that for your "real" case, that isn't the case (perhaps only the missing cases)?

The only way I could think of doing this in that case, would be to effectively subset df1 into two, one where df1$comm_Name == NA, and one where df1$comm_Name != NA.
Then do the above match using the former subset only, and re-combine the two subsets. Watch your factor levels (if applicable) in that case, though.

The result, though, for the given example would be identical... it would be far easier to just ensure that your "index" (df2) just has all the cases you're trying to match?
... or is my cold-addled brain missing something?

All else fails: for loop and some ifelse statements. But that would, of course, be cheating. :)

@anna-deasey
Copy link

I think using a combination of left_join() and coalesce() would do what you want - you get two versions of the col 'comm_name' with the join, then coalesce (merge them together) to get a new col with NA's replaced, and then delete the extra cols you don't need. It didn't work with factors though. Does that help?

df1$comm_name <- as.character(df1$comm_name)
df2$comm_name <- as.character(df2$comm_name)

df3 <- left_join(df1, df2, by = c('code_com' = 'comCode')) %>%
mutate(comm_name = coalesce(comm_name.x, comm_name.y)) %>%
select(-comm_name.x, -comm_name.y)

@mattnuttall00
Copy link
Author

Hey @jejoenje and @anna-deasey - both of those methods work. Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data wrangling question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants