Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAB 04 #8

Open
sunaynagoel opened this issue Feb 20, 2020 · 58 comments
Open

LAB 04 #8

sunaynagoel opened this issue Feb 20, 2020 · 58 comments

Comments

@sunaynagoel
Copy link

Part 1
#3
I think I have identified how many strings have trailing white spaces. I tried to remove them using trimws().My question is, once I have removed white spaces, if I run my code to find white spaces again in Mission field it should be returning zero or no matches. But that is not the case.

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

_ _, _ _, _ _, _ _, _ _ and _ _[1] 3464

trimws(dat$mission, "r")

Even after running this code the command

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

it return the same result.
_ _, _ _, _ _, _ _, _ _ and _ _[1] 3464

Not sure, what is going wrong.

@sunaynagoel
Copy link
Author

@lecy not sure what happened there but anciently I opened two LAB 04 issues. I closed one but thought you may want to delete it.

@lecy
Copy link
Contributor

lecy commented Feb 20, 2020

You are throwing but not catching! You need to assign the trimmed missions back to a new variable. Try:

dat$mission <- trimws( dat$mission, "r")
grepl( " $", x=dat$mission) %>% sum()

Note that you are counting mission statements with a single white space and no text. That is different than "how many strings have trailing white spaces".

You would need to specify [ any text ] [ white space ] [ end of line ].

@sunaynagoel
Copy link
Author

You are throwing but not catching! You need to assign the trimmed missions back to a new variable. Try:

dat$mission <- trimws( dat$mission, "r")
grepl( " $", x=dat$mission) %>% sum()

Note that you are counting mission statements with a single white space and no text. That is different than "how many strings have trailing white spaces".

You would need to specify [ any text ] [ white space ] [ end of line ].

Thank you this helps. As far as "strings with trailing spaces" does this work ?
"^.+\t\n\r\f$"
I was trying to replace \t\n\r\f with \s but my R is recognizing it.

@lecy
Copy link
Contributor

lecy commented Feb 20, 2020

It could be a wildcard, or a selector set. I have not tried this code so this is more pseudocode.

"* $"
"[alphanumeric] ^"

But basically something that says "any letter number or punctuation, then a space, then end of line."

@sunaynagoel
Copy link
Author

Part II

I have conceptual question about creating the dictionary. What happens if we don't create a dictionary for our corpus?
Also, just to be clear this create dictionary code (which is provided) is trying find compound word and putting them under one header. For eg;
non_profit=c("non-profit", "non profit"),
This is asking R to look for "non-profit" or "non profit" and put it under one category of non_profit?

my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
                             united_states = c("united states"),
                             high_school=c("high school"),
                             non_profit=c("non-profit", "non profit"),
                             stem=c("science technology engineering math", 
                                    "science technology engineering mathematics" ),
                             los_angeles=c("los angeles"),
                             ny_state=c("new york state"),
                             ny=c("new york")
                           ))

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )

@lecy
Copy link
Contributor

lecy commented Feb 20, 2020

The dictionary simplifies the data by turning these compound words into a single word. It's part of disambiguation.

If you don't apply it, your data is just a little noisier. It depends on the application - if you are very interested in a specific concept in your corpus ("President Bush") you might spend a lot of time making sure you capture all of the variants ("GW", "George W Bush", "Bush Jr", NOT "George HW Bush", etc.).

And correct - the dictionary is mapping all of the phrases on the right to the single term on the left. It is a find-and-replace operation.

@jmacost5
Copy link

I am not understanding how to solve the second part, I am getting confused how to start it.

@sunaynagoel
Copy link
Author

I am not understanding how to solve the second part, I am getting confused how to start it.

Hello @jmacost5 for Part II, I started with the codes provided in the instructions and skipped the sampling part. Hope this helps.
~Nina

@lecy
Copy link
Contributor

lecy commented Feb 21, 2020

@jmacost5 I'm going to need more information to answer your question. The instructions are:

Replicate the steps above with the following criteria:

Use the full mission dataset, not the small sample used in the demo.
Add at least ten concepts to your dictionary to convert compound words into single words.
Report the ten most frequently-used words in the missions statements after applying stemming.

Which part is unclear?

@sunaynagoel
Copy link
Author

sunaynagoel commented Feb 21, 2020

#Challenge Question
@lecy When I try to look inside code01 to get an idea how to divide it into better three sub-sectors, I find only one value "A" in all the entries. My questions are:
a. How to I divide in sub-sectors if all the values are identical?
b. Am I reading the question wrong ?

@lecy
Copy link
Contributor

lecy commented Feb 21, 2020

@sunaynagoel

URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
 dat <- readRDS(gzcon(url( URL )))

table( dat$code01 )
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W    X    Y    Z 
5325 7603  922 2359 1571 1378  699  252  633  417  943  700  828 6488 3683 7782  530  607 2483 2199  295   78 2261 3778  345  614 

@sunaynagoel
Copy link
Author

@sunaynagoel

URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
 dat <- readRDS(gzcon(url( URL )))

table( dat$code01 )
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W    X    Y    Z 
5325 7603  922 2359 1571 1378  699  252  633  417  943  700  828 6488 3683 7782  530  607 2483 2199  295   78 2261 3778  345  614 

Thank. I had to reload the dataset but now it is showing all the value.

@jmacost5
Copy link

@jmacost5 I'm going to need more information to answer your question. The instructions are:

Replicate the steps above with the following criteria:
Use the full mission dataset, not the small sample used in the demo.
Add at least ten concepts to your dictionary to convert compound words into single words.
Report the ten most frequently-used words in the missions statements after applying stemming.

Which part is unclear?

I guess the part where we make the compound words into single words. I do not understand how to do that. Do I make a function that removes all of them from the dictionary or just the few that are listed?

@jmacost5
Copy link

jmacost5 commented Feb 21, 2020

I am not understanding how to solve the second part, I am getting confused how to start it.

Hello @jmacost5 for Part II, I started with the codes provided in the instructions and skipped the sampling part. Hope this helps.
~Nina

I am not understanding if I am missing something when it comes to removing the compound words in the examples the removing of code is completely different

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

# remove punctuation 
tokens <- tokens( corp, what="word", remove_punct=TRUE )
head( tokens )

@sunaynagoel
Copy link
Author

@lecy The packages igraph and networkD3 are not available for the R version I have (3.6.1). Is there any way around ?

@lecy
Copy link
Contributor

lecy commented Feb 21, 2020

@sunaynagoel Please try installing via their GitHub version:

https://github.com/igraph/rigraph

devtools::install_github("gaborcsardi/pkgconfig")
devtools::install_github("igraph/rigraph")

NetworkD3

That's weird about D3. Is is that package, or a required package, that is not available?

You might try to download the Windows binary and install locally (packages >> install from local files)?

https://cran.r-project.org/web/packages/networkD3/index.html

@lecy
Copy link
Contributor

lecy commented Feb 21, 2020

@jmacost5

I guess the part where we make the compound words into single words. I do not understand how to do that. Do I make a function that removes all of them from the dictionary or just the few that are listed?

Here is the step where you translate compound words into single words:

my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
                             united_states = c("united states"),
                             high_school=c("high school"),
                             non_profit=c("non-profit", "non profit"),
                             stem=c("science technology engineering math", 
                                    "science technology engineering mathematics" ),
                             los_angeles=c("los angeles"),
                             ny_state=c("new york state"),
                             ny=c("new york")
                           ))

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )

Your job is to generate n-grams to find phrases that should be combined into single words. That step helps generate options for you to explore, then you would manually translate your selections to the dictionary list. You will add additional phrases or words to the dictionary similar to the examples:

non_profit=c("non-profit", "non profit")

When applied these multi-word phrases are replaced in the text.

These steps are other pre-processing steps:

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

# splits each sentence into a list of words
# remove punctuations first
tokens <- tokens( corp, what="word", remove_punct=TRUE )


# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )

Try: help( tokens_compound ) when quanteda is loaded. It will take you to the documentation files.

@jmacost5
Copy link

I wanted to know if there was a way to see the terms that are on the document or am I missing it from what we previously did ? I am trying to identify the terms used for the third part.

@lecy
Copy link
Contributor

lecy commented Feb 21, 2020

@jmacost5 I'm not sure what you mean by "see the terms that are on the document" ?

We are working with mission statements .After loading the data you can view the mission statements as:

dat$mission

If you want to browse in a spreadsheet view you can type:

View( dat )

Or you could write the data as a CSV file and open in excel:

getwd()  # where file will write to
write.csv( dat, "missions.csv" )

@castower
Copy link

Hello all,

I've run into a dilemma with solving problem one. It's my understanding that adding "^" to a pattern ensures that it checks to make sure that the pattern begins the sentence. However, the instructions also say to ignore capitalization and I can't figure out how to get the code to ignore the capitalization, it only finds an exact match when ^ is added. Is it alright if I run searches for the different capitalization styles separately and then just sum them?

Thanks!
Courtney

@jrcook15
Copy link

Hello all,

I've run into a dilemma with solving problem one. It's my understanding that adding "^" to a pattern ensures that it checks to make sure that the pattern begins the sentence. However, the instructions also say to ignore capitalization and I can't figure out how to get the code to ignore the capitalization, it only finds an exact match when ^ is added. Is it alright if I run searches for the different capitalization styles separately and then just sum them?

Thanks!
Courtney

Hi Courtney,

I tried, "^[Tt]+[Oo] " I believe it worked.

@castower
Copy link

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

@jrcook15
Copy link

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

There is a space after the [Oo] before the second ", that should eliminate 'Tooele'.

@castower
Copy link

@jrcook15 ah, okay!! Thank you :)

@jmacost5
Copy link

@jmacost5 I'm not sure what you mean by "see the terms that are on the document" ?

We are working with mission statements .After loading the data you can view the mission statements as:

dat$mission

If you want to browse in a spreadsheet view you can type:

View( dat )

Or you could write the data as a CSV file and open in excel:

getwd()  # where file will write to
write.csv( dat, "missions.csv" )

I am confused on how to look for the terms instead for the word "black". I can honestly only think of African American

@castower
Copy link

Hello all,
I'm currently working on trying to remove my trailing whitespaces. I currently have the following code:

dat$mission <- trimws(dat$mission, which = c("right"), whitespace = "* $" )

But I keep getting this error message:

Error in sub(re, "", x, perl = TRUE) : invalid regular expression '* $+$'

I'm not sure how to fix this.

@sunaynagoel
Copy link
Author

sunaynagoel commented Feb 23, 2020

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

There is a space after the [Oo] before the second ", that should eliminate 'Tooele'.

@jrcook15 @castower using ignore.case=T also works to make sure to or TO or To or tO all are considered. Also instead of using a space after "o" I used \b instead. It seemed to work for me.
("^to\b", x=dat$mission, value = TRUE, ignore.case = T)

@sunaynagoel
Copy link
Author

sunaynagoel commented Feb 23, 2020

Hello all,
I'm currently working on trying to remove my trailing whitespaces. I currently have the following code:

dat$mission <- trimws(dat$mission, which = c("right"), whitespace = "* $" )

But I keep getting this error message:

Error in sub(re, "", x, perl = TRUE) : invalid regular expression '* $+$'

I'm not sure how to fix this.

@castower I was getting the same error as well. But removing the criteria worked for me
dat$mission <- trimws( dat$mission, "r")

@castower
Copy link

@sunaynagoel that fixed it! I've been working on this for hours now, lol. Thank you so much!

@lecy
Copy link
Contributor

lecy commented Feb 23, 2020

@castower @jmacost5 Note that the grep() family of functions contain an ignore case argument:

grep( pattern, x, ignore.case = FALSE, ... )

This is very clever though!

"^[Tt]+[Oo] "

@lecy
Copy link
Contributor

lecy commented Feb 23, 2020

@jmacost5

I am confused on how to look for the terms instead for the word "black". I can honestly only think of African American

That is the hard and interesting part of the assignment. One thing you learn quickly when working with text is the usefulness of iteration. We know that "black" is ambiguous (it could be used for a lot of things in mission statements), but "African American" is probably not. So search for missions that contain that term, then look for other key words or phrases.

You just keep adding phrases until the process is not improving outcomes much at all.

You can also google some topics to try and find some words or phrases. If you try "nonprofit + african american" you get:

https://www.huffpost.com/entry/28-organizations-that-are-empowering-black-communities_n_58a730fde4b045cd34c13d9a

This gives you ideas like "black heritage", "black lives", and "women of color". It will just be trial and error, making sure you don't add words that add non-matches and trying no to miss words that add a lot of matches.

@sunaynagoel
Copy link
Author

@sunaynagoel Please try installing via their GitHub version:

https://github.com/igraph/rigraph

devtools::install_github("gaborcsardi/pkgconfig")
devtools::install_github("igraph/rigraph")

NetworkD3

That's weird about D3. Is is that package, or a required package, that is not available?

You might try to download the Windows binary and install locally (packages >> install from local files)?

https://cran.r-project.org/web/packages/networkD3/index.html

This is a package required for challenge question to make word networks.

@castower
Copy link

Hello all,
I'm currently working on summarizing my corpus data and I got the following error message:

nsentence() does not correctly count sentences in all lower-cased text

Is this okay? I still have a table produced, but I don't know if this will cause problems.

Thanks!

@lecy
Copy link
Contributor

lecy commented Feb 24, 2020

@castower That warning occurs at this step?

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

You can omit this step, and later convert tokens to all lower case:

# convert missions to all lower-case 
dat$mission <- tolower( dat$mission )

# after tokenization before counting terms: 
tokens <- tokens_tolower( tokens, keep_acronyms=TRUE )

But substantively it would not impact much for this lab to leave it in the original order. I suspect your results would not change either way.

@lecy
Copy link
Contributor

lecy commented Feb 24, 2020

@sunaynagoel Were you able to install either package? You don't need both - they will create similar network diagrams (the D3 version is interactive in an RMD HTML document, that's the only difference).

These are both popular packages, so I would be surprised if neither is working.

@castower
Copy link

castower commented Feb 24, 2020 via email

@castower
Copy link

@lecy for some reason,

dat$mission <- tolower( dat$mission )

did not work, however I changed it to

corp <- tolower( corp )

and it worked fine. It did change my final numbers of frequency for the top 10 keywords slightly, but not very much.

Thanks!

@sunaynagoel
Copy link
Author

sunaynagoel commented Feb 24, 2020

@sunaynagoel Were you able to install either package? You don't need both - they will create similar network diagrams (the D3 version is interactive in an RMD HTML document, that's the only difference).

These are both popular packages, so I would be surprised if neither is working.

I was able to download igraph. Thanks

@lecy
Copy link
Contributor

lecy commented Feb 24, 2020

@castower Ok, great. I'll make a note of that error.

It makes sense why removing capitalization would hinder efforts at automatically indentifying sentences. Humans would be pretty good at knowing a sentence had ended, but for computers you might have periods representing abbreviation in the middle of a sentence.

Acme Inc. has good toys online.

So period followed by lower-case suggests it is mid-sentence. If you remove cases that would be hard to identify, so you would end up with different splits.

The joys of text analysis!

@castower
Copy link

@jmacost5 a key term you can search for a lot of organizations that serve Black/African American populations is diaspora or more specifically African/Black diaspora. If you search diaspora generally, you can filter out the organizations that refer to other diasporas. Hope that helps!

@castower
Copy link

@lecy to clarify, for the challenge question are we examining the subset database of data related to the organizations serving Black communities or the entire database? Thanks!

@jmacost5
Copy link

I am trying to knit my lab and i keep getting this error even thought I have all my packages updated and installed.
Screen Shot 2020-02-24 at 9 07 54 AM

@lecy
Copy link
Contributor

lecy commented Feb 24, 2020

@castower The challenge questions would use the entire database.

@lecy
Copy link
Contributor

lecy commented Feb 24, 2020

@jmacost5 Did you include dplyr in your load libraries chunk?

@jmacost5
Copy link

@jmacost5 Did you include dplyr in your load libraries chunk?

Yes I did and I included it in my code, I am getting an error about 'corp' now, is there a package that I am missing, I put dpylr, pander, and quantda

@lecy
Copy link
Contributor

lecy commented Feb 24, 2020

"quanteda" or "quantda" ?

@jmacost5
Copy link

"quanteda" or "quantda" ?

quanteda

@lecy
Copy link
Contributor

lecy commented Feb 24, 2020

I would need more to go to diagnose the problem (you haven't provided a lot of information or your code so it is a bit of a guessing game). Do you want to send me the RMD file?

@castower
Copy link

castower commented Feb 24, 2020 via email

@castower
Copy link

castower commented Feb 25, 2020

Hello,
I've run into an error with part two of the assignment. I'm currently working on trying to create a network for arts and I have the following code:

#tokens.cat2 are the tokens I got from Part 1 of the assignment and I'm reusing them here
arts.token.list <- as.list(tokens.cat2)
arts.token.list <- lapply( arts.token.list, function(x){ x[ ! grepl( "^$", x ) ] } )
arts.token.list[[1]]
listToNet <- function( x )
{
   
   word.pairs <- list()
   
   for( i in 1:length(x) )
   {
      x.i <- x[[i]]
      word.pairs[[i]] <- NULL
      if( length( x.i ) > 1 ) { word.pairs[[i]] <-  data.frame( t( combn( x.i, 2) ) ) }
      if( length( x.i ) > 1 ) { names( word.pairs[[i]] ) <-  c("from","to") }
   }
   
   return( word.pairs )

}

g.list1 <- listToNet( arts.token.list )
head( g.list1[[1]] )
# I created this function because there was not an option for if the organizations were art related or not, so I have a 1 assigned to those organizations that do not have 'Arts' at the start of their activity code as we created for the activity code variable in the assignment.

 dat$art <- ifelse( grepl( "^art", dat$activity.code, ignore.case = T ), 1, 0) 
table( dat$art, useNA="ifany" )
g.list.1 <- g.list1[ dat$art == 1 ]
m1 <- bind_rows( g.list.1 )
length( g.list.1 )
g.list.2 <- g.list1[ dat$art == 0 ]
m2 <- bind_rows( g.list.2 )
length( g.list.2 )

All the previous code works, but then when I reach this code:

g.art.yes <- graph.edgelist( as.matrix(m1), directed=FALSE )
g.art.no <- graph.edgelist( as.matrix(m2), directed=FALSE )

summary( g.art.yes )
summary( g.art.no )
I get the following error: 
Error in graph.edgelist(as.matrix(m2), directed = FALSE) : graph_from_edgelist expects a matrix with two columns

@lecy

@lecy
Copy link
Contributor

lecy commented Feb 25, 2020

@castower Can you send me the file by email please?

Note that in the dataset code01 and codedef01 tell you the subsectors if you want to identify them that way:

> head( dat )
        ein                           orgname
1 311767271              NIA PERFORMING ARTS 
2 463091113       THE YOUNG ACTORS GUILD INC 
3 824331000                   RUTH STAGE INC 
4 823821811 STRIPLIGHT COMMUNITY THEATRE INC 
5 911738135       NU BLACK ARTS WEST THEATRE 
6 824668235     OLIVE BRANCH THEATRICALS INC 
                                                                                                                                                                                                                                       mission
1                                                                                                                                         a community based art organization that inspires, nutures,educates and empower artist and community.
2         we engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. we produce community theater productions for children as well as educational theater camps and workshops.
3                                                                                                                                                                                                     theater performances and performing arts
4                                                                                                                                                                                                                                             
5                                                                                                                                                                                                                                             
6 to produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability.
  code01                     codedef01 code02 codedef02 orgpurposecharitable
1      A Arts, Culture, and Humanities    A65   Theater                    1
2      A Arts, Culture, and Humanities    A65   Theater                    0
3      A Arts, Culture, and Humanities    A65   Theater                    1
4      A Arts, Culture, and Humanities    A65   Theater                    1
5      A Arts, Culture, and Humanities    A65   Theater                    1
6      A Arts, Culture, and Humanities    A65   Theater                    0

@castower
Copy link

@lecy thanks! I just sent over my RMD file. I will look into using the codes. -Courtney

@lecy
Copy link
Contributor

lecy commented Feb 25, 2020

@castower Just sent it back. A preview of one of the semantic networks:

image

@castower
Copy link

Hello all,

So I have been working with the stringr functions a little more and I'm a bit confused what I'm doing wrong.

I have created the following test data set:

test <- c("hello my name is Courtney")

and I am trying to extract everything after 'hello' so that I can get an output of

my name is Courtney

However, when I run the following:

str_extract_all(test,"(?<=hello )\\S{0,}") 

All that I'm getting is:

[[1]]
[1] "my"

Any tips on what I am doing wrong?

@castower
Copy link

Also, as a note, I tried str_split instead and for some reason, it deletes 'my':

code:

test <- c("hello my name is Courtney")
str_split(test,"(?<=hello )\\S{0,}") 

Output:

[[1]]
[1] "hello "            " name is Courtney"

@sunaynagoel
Copy link
Author

str_extract_all(test,"(?<=hello )\S{0,}")

@castower I am not sure what is wrong. But your question was interesting enough for me to try it out on my own.
When I run this code, it eliminates all the 'h''e''l'and 'o' from the entire string. It is not treating 'hello' as a word. I tried \b as well but did not work

str_extract_all(test,"([^hello])") 

[[1]]
[1] " " "m" "y" " " "n" "a" "m" " " "i" "s" " " "C" "u" "r" "t" "n" "y"

But when I try word (),it works

word(test, 2,-1)

[1] "my name is Courtney"

It would be nice to write a function which can eliminate first word in every sentence.

@lecy
Copy link
Contributor

lecy commented Feb 25, 2020

I am going to be honest that when people immediately default to tidyverse packages I feel a little like an old man. Get off my lawn, Hadley Wickham!

I find that tidy packages are really great at scaling operations. Once you know how to do them, then they make it easier and faster to accomplish. They are not always great when you are just learning a new skill in R because they try to be clever and protect you from some of the complicated parts of the code, and they are written in a way that tries to generalize each step at scale. As a result, you lose some of the intuition about what is happening.

For example, group_by( f1, f2 ) %>% mutate( n=n() %>% ungroup() is super easy and efficient to write, but behind the scenes the data is being split into many smaller datasets, variables are summarized on subsets, and then everything is recombined in a way that reconciles all of the dimensions correctly. The actual process is not obvious to the neophyte. I used to have to do all of the steps individually, so now I see how great that code is and how much time it saves.

So let me conclude this soap box by saying it is sometimes helpful to start with core R functions because they tend to operate at the most basic level, and can be helpful for understanding problems.

Your issue here, you want a process to remove the first word from each sentence. My question would be, what is your pseudocode. What do you mean by first word? Does it have to be "hello", or can it be any word? How do you operationalize the first word?

Try something like this:

> test <- c("hello my name is Courtney")
> 
> # non-generalizable version - just remove hello
> gsub( "^hello ", "", test )
[1] "my name is Courtney"
> 
> # > args( strsplit )
> # function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
> # split everything seperated by a space into distinct words
> # this is what "tokenization" does
> #
> x.split.list <- strsplit( test, " " )
> x.split.list
[[1]]
[1] "hello"    "my"       "name"     "is"       "Courtney"
>
>
> # extract the vector from the list 
> x.split <- x.split.list[[ 1 ]]
> new.x <- x.split[ -1 ]  # drop first word
> new.x
[1] "my"       "name"     "is"       "Courtney"
> 
> # combine vector elements back into a single string:
> # when you add collapse as an argument to paste it 
> # mashes all elements of a vector into a single string 
> 
> paste0( new.x, collapse=" " )  
[1] "my name is Courtney"

Note that parentheses in regular expressions are not like putting things in quotes. It actually atomizes the words in the parentheses into individual letters rather than isolating the specific word. So this expression:

gsub( "^hello ", "", test )
x.split.list <- strsplit( test, "[hello]" )

Would split all of the text by H, E, L, or O and return all of the new atomized strings.

@castower
Copy link

@lecy thanks so much for the detailed response! It really helped me understand what's going on "behind the scenes". I agree, the tidyverse "masks" a lot of the details when I try to follow exactly what is going on. Thanks again!

@sunaynagoel thanks for the word() tip. I had not tried that function yet, but it's very useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants