LAB 04 #8

sunaynagoel · 2020-02-20T17:44:08Z

Part 1
#3
I think I have identified how many strings have trailing white spaces. I tried to remove them using trimws().My question is, once I have removed white spaces, if I run my code to find white spaces again in Mission field it should be returning zero or no matches. But that is not the case.

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

_ _, _ _, _ _, _ _, _ _ and _ _[1] 3464

trimws(dat$mission, "r")

Even after running this code the command

grep(" $", x=dat$mission, value = TRUE, perl = T) %>% head() %>% pander()
grepl( " $", x=dat$mission) %>% sum()

it return the same result.
_ _, _ _, _ _, _ _, _ _ and _ _[1] 3464

Not sure, what is going wrong.

The text was updated successfully, but these errors were encountered:

sunaynagoel · 2020-02-20T17:49:25Z

@lecy not sure what happened there but anciently I opened two LAB 04 issues. I closed one but thought you may want to delete it.

lecy · 2020-02-20T18:57:55Z

You are throwing but not catching! You need to assign the trimmed missions back to a new variable. Try:

dat$mission <- trimws( dat$mission, "r")
grepl( " $", x=dat$mission) %>% sum()

Note that you are counting mission statements with a single white space and no text. That is different than "how many strings have trailing white spaces".

You would need to specify [ any text ] [ white space ] [ end of line ].

sunaynagoel · 2020-02-20T20:25:51Z

You are throwing but not catching! You need to assign the trimmed missions back to a new variable. Try:
dat$mission <- trimws( dat$mission, "r")
grepl( " $", x=dat$mission) %>% sum()
Note that you are counting mission statements with a single white space and no text. That is different than "how many strings have trailing white spaces".

You would need to specify [ any text ] [ white space ] [ end of line ].

Thank you this helps. As far as "strings with trailing spaces" does this work ?
"^.+\t\n\r\f$"
I was trying to replace \t\n\r\f with \s but my R is recognizing it.

lecy · 2020-02-20T21:32:43Z

It could be a wildcard, or a selector set. I have not tried this code so this is more pseudocode.

"* $"
"[alphanumeric] ^"

But basically something that says "any letter number or punctuation, then a space, then end of line."

sunaynagoel · 2020-02-20T23:10:47Z

Part II

I have conceptual question about creating the dictionary. What happens if we don't create a dictionary for our corpus?
Also, just to be clear this create dictionary code (which is provided) is trying find compound word and putting them under one header. For eg;
non_profit=c("non-profit", "non profit"),
This is asking R to look for "non-profit" or "non profit" and put it under one category of non_profit?

my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
                             united_states = c("united states"),
                             high_school=c("high school"),
                             non_profit=c("non-profit", "non profit"),
                             stem=c("science technology engineering math", 
                                    "science technology engineering mathematics" ),
                             los_angeles=c("los angeles"),
                             ny_state=c("new york state"),
                             ny=c("new york")
                           ))

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )

lecy · 2020-02-20T23:39:21Z

The dictionary simplifies the data by turning these compound words into a single word. It's part of disambiguation.

If you don't apply it, your data is just a little noisier. It depends on the application - if you are very interested in a specific concept in your corpus ("President Bush") you might spend a lot of time making sure you capture all of the variants ("GW", "George W Bush", "Bush Jr", NOT "George HW Bush", etc.).

And correct - the dictionary is mapping all of the phrases on the right to the single term on the left. It is a find-and-replace operation.

jmacost5 · 2020-02-21T01:47:48Z

I am not understanding how to solve the second part, I am getting confused how to start it.

sunaynagoel · 2020-02-21T03:06:55Z

I am not understanding how to solve the second part, I am getting confused how to start it.

Hello @jmacost5 for Part II, I started with the codes provided in the instructions and skipped the sampling part. Hope this helps.
~Nina

lecy · 2020-02-21T03:07:25Z

@jmacost5 I'm going to need more information to answer your question. The instructions are:

Replicate the steps above with the following criteria:

Use the full mission dataset, not the small sample used in the demo.
Add at least ten concepts to your dictionary to convert compound words into single words.
Report the ten most frequently-used words in the missions statements after applying stemming.

Which part is unclear?

sunaynagoel · 2020-02-21T04:07:38Z

#Challenge Question
@lecy When I try to look inside code01 to get an idea how to divide it into better three sub-sectors, I find only one value "A" in all the entries. My questions are:
a. How to I divide in sub-sectors if all the values are identical?
b. Am I reading the question wrong ?

lecy · 2020-02-21T05:55:08Z

@sunaynagoel

URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
 dat <- readRDS(gzcon(url( URL )))

table( dat$code01 )
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W    X    Y    Z 
5325 7603  922 2359 1571 1378  699  252  633  417  943  700  828 6488 3683 7782  530  607 2483 2199  295   78 2261 3778  345  614

sunaynagoel · 2020-02-21T15:27:54Z

@sunaynagoel

URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
 dat <- readRDS(gzcon(url( URL )))

table( dat$code01 )
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W    X    Y    Z 
5325 7603  922 2359 1571 1378  699  252  633  417  943  700  828 6488 3683 7782  530  607 2483 2199  295   78 2261 3778  345  614

Thank. I had to reload the dataset but now it is showing all the value.

jmacost5 · 2020-02-21T15:49:53Z

@jmacost5 I'm going to need more information to answer your question. The instructions are:

Replicate the steps above with the following criteria:
Use the full mission dataset, not the small sample used in the demo.
Add at least ten concepts to your dictionary to convert compound words into single words.
Report the ten most frequently-used words in the missions statements after applying stemming.

Which part is unclear?

I guess the part where we make the compound words into single words. I do not understand how to do that. Do I make a function that removes all of them from the dictionary or just the few that are listed?

jmacost5 · 2020-02-21T15:55:45Z

I am not understanding how to solve the second part, I am getting confused how to start it.

Hello @jmacost5 for Part II, I started with the codes provided in the instructions and skipped the sampling part. Hope this helps.
~Nina

I am not understanding if I am missing something when it comes to removing the compound words in the examples the removing of code is completely different

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

# remove punctuation 
tokens <- tokens( corp, what="word", remove_punct=TRUE )
head( tokens )

sunaynagoel · 2020-02-21T18:01:23Z

@lecy The packages igraph and networkD3 are not available for the R version I have (3.6.1). Is there any way around ?

lecy · 2020-02-21T18:25:40Z

@sunaynagoel Please try installing via their GitHub version:

https://github.com/igraph/rigraph

devtools::install_github("gaborcsardi/pkgconfig")
devtools::install_github("igraph/rigraph")

NetworkD3

That's weird about D3. Is is that package, or a required package, that is not available?

You might try to download the Windows binary and install locally (packages >> install from local files)?

https://cran.r-project.org/web/packages/networkD3/index.html

lecy · 2020-02-21T18:29:39Z

@jmacost5

I guess the part where we make the compound words into single words. I do not understand how to do that. Do I make a function that removes all of them from the dictionary or just the few that are listed?

Here is the step where you translate compound words into single words:

my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
                             united_states = c("united states"),
                             high_school=c("high school"),
                             non_profit=c("non-profit", "non profit"),
                             stem=c("science technology engineering math", 
                                    "science technology engineering mathematics" ),
                             los_angeles=c("los angeles"),
                             ny_state=c("new york state"),
                             ny=c("new york")
                           ))

# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )

Your job is to generate n-grams to find phrases that should be combined into single words. That step helps generate options for you to explore, then you would manually translate your selections to the dictionary list. You will add additional phrases or words to the dictionary similar to the examples:

non_profit=c("non-profit", "non profit")

When applied these multi-word phrases are replaced in the text.

These steps are other pre-processing steps:

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

# splits each sentence into a list of words
# remove punctuations first
tokens <- tokens( corp, what="word", remove_punct=TRUE )


# apply the dictionary to the text 
tokens <- tokens_compound( tokens, pattern=my_dictionary )

Try: help( tokens_compound ) when quanteda is loaded. It will take you to the documentation files.

jmacost5 · 2020-02-21T22:04:48Z

I wanted to know if there was a way to see the terms that are on the document or am I missing it from what we previously did ? I am trying to identify the terms used for the third part.

lecy · 2020-02-21T23:45:32Z

@jmacost5 I'm not sure what you mean by "see the terms that are on the document" ?

We are working with mission statements .After loading the data you can view the mission statements as:

dat$mission

If you want to browse in a spreadsheet view you can type:

View( dat )

Or you could write the data as a CSV file and open in excel:

getwd()  # where file will write to
write.csv( dat, "missions.csv" )

castower · 2020-02-23T00:33:07Z

Hello all,

I've run into a dilemma with solving problem one. It's my understanding that adding "^" to a pattern ensures that it checks to make sure that the pattern begins the sentence. However, the instructions also say to ignore capitalization and I can't figure out how to get the code to ignore the capitalization, it only finds an exact match when ^ is added. Is it alright if I run searches for the different capitalization styles separately and then just sum them?

Thanks!
Courtney

jrcook15 · 2020-02-23T00:41:08Z

Hello all,

I've run into a dilemma with solving problem one. It's my understanding that adding "^" to a pattern ensures that it checks to make sure that the pattern begins the sentence. However, the instructions also say to ignore capitalization and I can't figure out how to get the code to ignore the capitalization, it only finds an exact match when ^ is added. Is it alright if I run searches for the different capitalization styles separately and then just sum them?

Thanks!
Courtney

Hi Courtney,

I tried, "^[Tt]+[Oo] " I believe it worked.

castower · 2020-02-23T00:49:33Z

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

jrcook15 · 2020-02-23T00:58:47Z

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

There is a space after the [Oo] before the second ", that should eliminate 'Tooele'.

castower · 2020-02-23T01:03:36Z

@jrcook15 ah, okay!! Thank you :)

jmacost5 · 2020-02-23T02:13:33Z

@jmacost5 I'm not sure what you mean by "see the terms that are on the document" ?

We are working with mission statements .After loading the data you can view the mission statements as:
dat$mission
If you want to browse in a spreadsheet view you can type:
View( dat )
Or you could write the data as a CSV file and open in excel:
getwd()  # where file will write to
write.csv( dat, "missions.csv" )

I am confused on how to look for the terms instead for the word "black". I can honestly only think of African American

castower · 2020-02-23T02:58:39Z

Hello all,
I'm currently working on trying to remove my trailing whitespaces. I currently have the following code:

dat$mission <- trimws(dat$mission, which = c("right"), whitespace = "* $" )

But I keep getting this error message:

Error in sub(re, "", x, perl = TRUE) : invalid regular expression '* $+$'

I'm not sure how to fix this.

sunaynagoel · 2020-02-23T03:01:33Z

@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it!

There is a space after the [Oo] before the second ", that should eliminate 'Tooele'.

@jrcook15 @castower using ignore.case=T also works to make sure to or TO or To or tO all are considered. Also instead of using a space after "o" I used \b instead. It seemed to work for me.
("^to\b", x=dat$mission, value = TRUE, ignore.case = T)

sunaynagoel · 2020-02-23T03:05:47Z

Hello all,
I'm currently working on trying to remove my trailing whitespaces. I currently have the following code:
dat$mission <- trimws(dat$mission, which = c("right"), whitespace = "* $" )
But I keep getting this error message:
Error in sub(re, "", x, perl = TRUE) : invalid regular expression '* $+$'
I'm not sure how to fix this.

@castower I was getting the same error as well. But removing the criteria worked for me
dat$mission <- trimws( dat$mission, "r")

castower · 2020-02-23T03:13:26Z

@sunaynagoel that fixed it! I've been working on this for hours now, lol. Thank you so much!

lecy · 2020-02-23T05:35:56Z

@castower @jmacost5 Note that the grep() family of functions contain an ignore case argument:

grep( pattern, x, ignore.case = FALSE, ... )

This is very clever though!

"^[Tt]+[Oo] "

lecy · 2020-02-23T05:49:20Z

@jmacost5

I am confused on how to look for the terms instead for the word "black". I can honestly only think of African American

That is the hard and interesting part of the assignment. One thing you learn quickly when working with text is the usefulness of iteration. We know that "black" is ambiguous (it could be used for a lot of things in mission statements), but "African American" is probably not. So search for missions that contain that term, then look for other key words or phrases.

You just keep adding phrases until the process is not improving outcomes much at all.

You can also google some topics to try and find some words or phrases. If you try "nonprofit + african american" you get:

https://www.huffpost.com/entry/28-organizations-that-are-empowering-black-communities_n_58a730fde4b045cd34c13d9a

This gives you ideas like "black heritage", "black lives", and "women of color". It will just be trial and error, making sure you don't add words that add non-matches and trying no to miss words that add a lot of matches.

sunaynagoel · 2020-02-23T23:58:07Z

@sunaynagoel Please try installing via their GitHub version:

https://github.com/igraph/rigraph
devtools::install_github("gaborcsardi/pkgconfig")
devtools::install_github("igraph/rigraph")
NetworkD3

That's weird about D3. Is is that package, or a required package, that is not available?

You might try to download the Windows binary and install locally (packages >> install from local files)?

https://cran.r-project.org/web/packages/networkD3/index.html

This is a package required for challenge question to make word networks.

castower · 2020-02-24T02:09:43Z

Hello all,
I'm currently working on summarizing my corpus data and I got the following error message:

nsentence() does not correctly count sentences in all lower-cased text

Is this okay? I still have a table produced, but I don't know if this will cause problems.

Thanks!

lecy · 2020-02-24T04:55:37Z

@castower That warning occurs at this step?

# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )

You can omit this step, and later convert tokens to all lower case:

# convert missions to all lower-case 
dat$mission <- tolower( dat$mission )

# after tokenization before counting terms: 
tokens <- tokens_tolower( tokens, keep_acronyms=TRUE )

But substantively it would not impact much for this lab to leave it in the original order. I suspect your results would not change either way.

lecy · 2020-02-24T05:01:38Z

@sunaynagoel Were you able to install either package? You don't need both - they will create similar network diagrams (the D3 version is interactive in an RMD HTML document, that's the only difference).

These are both popular packages, so I would be surprised if neither is working.

castower · 2020-02-24T05:08:25Z

Yes, thank you!

…

On Sun, Feb 23, 2020, 8:55 PM Jesse Lecy ***@***.***> wrote: @castower <https://github.com/castower> That warning occurs at this step? # remove mission statements that are less than 3 sentences longcorp <- corpus_trim( corp, what="sentences", min_ntoken=3 ) You can omit this step, and later convert tokens to all lower case: # convert missions to all lower-case dat$mission <- tolower( dat$mission ) # after tokenization before counting terms: tokens <- tokens_tolower( tokens, keep_acronyms=TRUE ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AM6K2WQG2PXTXBJFRROVTOLRENHMVA5CNFSM4KYTWC42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMWT3EA#issuecomment-590167440>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM6K2WS2I3ZGE3DZPPV7SR3RENHMVANCNFSM4KYTWC4Q> .

castower · 2020-02-24T06:00:25Z

@lecy for some reason,

dat$mission <- tolower( dat$mission )

did not work, however I changed it to

corp <- tolower( corp )

and it worked fine. It did change my final numbers of frequency for the top 10 keywords slightly, but not very much.

Thanks!

sunaynagoel · 2020-02-24T06:02:51Z

@sunaynagoel Were you able to install either package? You don't need both - they will create similar network diagrams (the D3 version is interactive in an RMD HTML document, that's the only difference).

These are both popular packages, so I would be surprised if neither is working.

I was able to download igraph. Thanks

lecy · 2020-02-24T06:04:21Z

@castower Ok, great. I'll make a note of that error.

It makes sense why removing capitalization would hinder efforts at automatically indentifying sentences. Humans would be pretty good at knowing a sentence had ended, but for computers you might have periods representing abbreviation in the middle of a sentence.

Acme Inc. has good toys online.

So period followed by lower-case suggests it is mid-sentence. If you remove cases that would be hard to identify, so you would end up with different splits.

The joys of text analysis!

castower · 2020-02-24T06:29:20Z

@jmacost5 a key term you can search for a lot of organizations that serve Black/African American populations is diaspora or more specifically African/Black diaspora. If you search diaspora generally, you can filter out the organizations that refer to other diasporas. Hope that helps!

castower · 2020-02-24T09:37:32Z

@lecy to clarify, for the challenge question are we examining the subset database of data related to the organizations serving Black communities or the entire database? Thanks!

jmacost5 · 2020-02-24T16:08:00Z

I am trying to knit my lab and i keep getting this error even thought I have all my packages updated and installed.

lecy · 2020-02-24T16:11:53Z

@castower The challenge questions would use the entire database.

lecy · 2020-02-24T16:12:14Z

@jmacost5 Did you include dplyr in your load libraries chunk?

jmacost5 · 2020-02-24T16:20:08Z

@jmacost5 Did you include dplyr in your load libraries chunk?

Yes I did and I included it in my code, I am getting an error about 'corp' now, is there a package that I am missing, I put dpylr, pander, and quantda

lecy · 2020-02-24T16:23:50Z

"quanteda" or "quantda" ?

jmacost5 · 2020-02-24T16:25:52Z

"quanteda" or "quantda" ?

quanteda

lecy · 2020-02-24T16:45:16Z

I would need more to go to diagnose the problem (you haven't provided a lot of information or your code so it is a bit of a guessing game). Do you want to send me the RMD file?

castower · 2020-02-24T18:41:52Z

@lecy, I thought so, but wanted to check! Thank you so much!

…

On Mon, Feb 24, 2020, 8:11 AM Jesse Lecy ***@***.***> wrote: @castower <https://github.com/castower> The challenge questions would use the entire database. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8?email_source=notifications&email_token=AM6K2WQV4FOGSKN7NLSEAJDREPWUVA5CNFSM4KYTWC42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMYOYQA#issuecomment-590408768>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM6K2WV77WGVCLIJTX3EJK3REPWUVANCNFSM4KYTWC4Q> .

castower · 2020-02-25T02:57:11Z

Hello,
I've run into an error with part two of the assignment. I'm currently working on trying to create a network for arts and I have the following code:

#tokens.cat2 are the tokens I got from Part 1 of the assignment and I'm reusing them here
arts.token.list <- as.list(tokens.cat2)
arts.token.list <- lapply( arts.token.list, function(x){ x[ ! grepl( "^$", x ) ] } )
arts.token.list[[1]]

listToNet <- function( x )
{
   
   word.pairs <- list()
   
   for( i in 1:length(x) )
   {
      x.i <- x[[i]]
      word.pairs[[i]] <- NULL
      if( length( x.i ) > 1 ) { word.pairs[[i]] <-  data.frame( t( combn( x.i, 2) ) ) }
      if( length( x.i ) > 1 ) { names( word.pairs[[i]] ) <-  c("from","to") }
   }
   
   return( word.pairs )

}

g.list1 <- listToNet( arts.token.list )
head( g.list1[[1]] )

# I created this function because there was not an option for if the organizations were art related or not, so I have a 1 assigned to those organizations that do not have 'Arts' at the start of their activity code as we created for the activity code variable in the assignment.

 dat$art <- ifelse( grepl( "^art", dat$activity.code, ignore.case = T ), 1, 0)

table( dat$art, useNA="ifany" )

g.list.1 <- g.list1[ dat$art == 1 ]
m1 <- bind_rows( g.list.1 )
length( g.list.1 )

g.list.2 <- g.list1[ dat$art == 0 ]
m2 <- bind_rows( g.list.2 )
length( g.list.2 )

All the previous code works, but then when I reach this code:

g.art.yes <- graph.edgelist( as.matrix(m1), directed=FALSE )
g.art.no <- graph.edgelist( as.matrix(m2), directed=FALSE )

summary( g.art.yes )
summary( g.art.no )

I get the following error: 
Error in graph.edgelist(as.matrix(m2), directed = FALSE) : graph_from_edgelist expects a matrix with two columns

@lecy

lecy · 2020-02-25T04:44:35Z

@castower Can you send me the file by email please?

Note that in the dataset code01 and codedef01 tell you the subsectors if you want to identify them that way:

> head( dat )
        ein                           orgname
1 311767271              NIA PERFORMING ARTS 
2 463091113       THE YOUNG ACTORS GUILD INC 
3 824331000                   RUTH STAGE INC 
4 823821811 STRIPLIGHT COMMUNITY THEATRE INC 
5 911738135       NU BLACK ARTS WEST THEATRE 
6 824668235     OLIVE BRANCH THEATRICALS INC 
                                                                                                                                                                                                                                       mission
1                                                                                                                                         a community based art organization that inspires, nutures,educates and empower artist and community.
2         we engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. we produce community theater productions for children as well as educational theater camps and workshops.
3                                                                                                                                                                                                     theater performances and performing arts
4                                                                                                                                                                                                                                             
5                                                                                                                                                                                                                                             
6 to produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability.
  code01                     codedef01 code02 codedef02 orgpurposecharitable
1      A Arts, Culture, and Humanities    A65   Theater                    1
2      A Arts, Culture, and Humanities    A65   Theater                    0
3      A Arts, Culture, and Humanities    A65   Theater                    1
4      A Arts, Culture, and Humanities    A65   Theater                    1
5      A Arts, Culture, and Humanities    A65   Theater                    1
6      A Arts, Culture, and Humanities    A65   Theater                    0

castower · 2020-02-25T04:49:08Z

@lecy thanks! I just sent over my RMD file. I will look into using the codes. -Courtney

lecy · 2020-02-25T07:01:32Z

@castower Just sent it back. A preview of one of the semantic networks:

castower · 2020-02-25T21:49:42Z

Hello all,

So I have been working with the stringr functions a little more and I'm a bit confused what I'm doing wrong.

I have created the following test data set:

test <- c("hello my name is Courtney")

and I am trying to extract everything after 'hello' so that I can get an output of

my name is Courtney

However, when I run the following:

str_extract_all(test,"(?<=hello )\\S{0,}")

All that I'm getting is:

[[1]]
[1] "my"

Any tips on what I am doing wrong?

castower · 2020-02-25T21:54:20Z

Also, as a note, I tried str_split instead and for some reason, it deletes 'my':

code:

test <- c("hello my name is Courtney")

str_split(test,"(?<=hello )\\S{0,}")

Output:

[[1]]
[1] "hello "            " name is Courtney"

sunaynagoel · 2020-02-25T22:33:49Z

str_extract_all(test,"(?<=hello )\S{0,}")

@castower I am not sure what is wrong. But your question was interesting enough for me to try it out on my own.
When I run this code, it eliminates all the 'h''e''l'and 'o' from the entire string. It is not treating 'hello' as a word. I tried \b as well but did not work

str_extract_all(test,"([^hello])")

[[1]]
[1] " " "m" "y" " " "n" "a" "m" " " "i" "s" " " "C" "u" "r" "t" "n" "y"

But when I try word (),it works

word(test, 2,-1)

[1] "my name is Courtney"

It would be nice to write a function which can eliminate first word in every sentence.

lecy · 2020-02-25T23:48:37Z

I am going to be honest that when people immediately default to tidyverse packages I feel a little like an old man. Get off my lawn, Hadley Wickham!

I find that tidy packages are really great at scaling operations. Once you know how to do them, then they make it easier and faster to accomplish. They are not always great when you are just learning a new skill in R because they try to be clever and protect you from some of the complicated parts of the code, and they are written in a way that tries to generalize each step at scale. As a result, you lose some of the intuition about what is happening.

For example, group_by( f1, f2 ) %>% mutate( n=n() %>% ungroup() is super easy and efficient to write, but behind the scenes the data is being split into many smaller datasets, variables are summarized on subsets, and then everything is recombined in a way that reconciles all of the dimensions correctly. The actual process is not obvious to the neophyte. I used to have to do all of the steps individually, so now I see how great that code is and how much time it saves.

So let me conclude this soap box by saying it is sometimes helpful to start with core R functions because they tend to operate at the most basic level, and can be helpful for understanding problems.

Your issue here, you want a process to remove the first word from each sentence. My question would be, what is your pseudocode. What do you mean by first word? Does it have to be "hello", or can it be any word? How do you operationalize the first word?

Try something like this:

> test <- c("hello my name is Courtney")
> 
> # non-generalizable version - just remove hello
> gsub( "^hello ", "", test )
[1] "my name is Courtney"
> 
> # > args( strsplit )
> # function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
> # split everything seperated by a space into distinct words
> # this is what "tokenization" does
> #
> x.split.list <- strsplit( test, " " )
> x.split.list
[[1]]
[1] "hello"    "my"       "name"     "is"       "Courtney"
>
>
> # extract the vector from the list 
> x.split <- x.split.list[[ 1 ]]
> new.x <- x.split[ -1 ]  # drop first word
> new.x
[1] "my"       "name"     "is"       "Courtney"
> 
> # combine vector elements back into a single string:
> # when you add collapse as an argument to paste it 
> # mashes all elements of a vector into a single string 
> 
> paste0( new.x, collapse=" " )  
[1] "my name is Courtney"

Note that parentheses in regular expressions are not like putting things in quotes. It actually atomizes the words in the parentheses into individual letters rather than isolating the specific word. So this expression:

gsub( "^hello ", "", test )
x.split.list <- strsplit( test, "[hello]" )

Would split all of the text by H, E, L, or O and return all of the new atomized strings.

castower · 2020-02-26T04:55:15Z

@lecy thanks so much for the detailed response! It really helped me understand what's going on "behind the scenes". I agree, the tidyverse "masks" a lot of the details when I try to follow exactly what is going on. Thanks again!

@sunaynagoel thanks for the word() tip. I had not tried that function yet, but it's very useful!

LAB 04 #8

LAB 04 #8

Comments

sunaynagoel commented Feb 20, 2020

sunaynagoel commented Feb 20, 2020

lecy commented Feb 20, 2020

sunaynagoel commented Feb 20, 2020

lecy commented Feb 20, 2020

sunaynagoel commented Feb 20, 2020

lecy commented Feb 20, 2020

jmacost5 commented Feb 21, 2020

sunaynagoel commented Feb 21, 2020

lecy commented Feb 21, 2020

sunaynagoel commented Feb 21, 2020 • edited Loading

lecy commented Feb 21, 2020

sunaynagoel commented Feb 21, 2020

jmacost5 commented Feb 21, 2020

jmacost5 commented Feb 21, 2020 • edited by lecy Loading

sunaynagoel commented Feb 21, 2020

lecy commented Feb 21, 2020

NetworkD3

lecy commented Feb 21, 2020 • edited Loading

jmacost5 commented Feb 21, 2020

lecy commented Feb 21, 2020

castower commented Feb 23, 2020

jrcook15 commented Feb 23, 2020

castower commented Feb 23, 2020

jrcook15 commented Feb 23, 2020

castower commented Feb 23, 2020

jmacost5 commented Feb 23, 2020

castower commented Feb 23, 2020

sunaynagoel commented Feb 23, 2020 • edited Loading

sunaynagoel commented Feb 23, 2020 • edited Loading

castower commented Feb 23, 2020

lecy commented Feb 23, 2020

lecy commented Feb 23, 2020

sunaynagoel commented Feb 23, 2020

NetworkD3

castower commented Feb 24, 2020

lecy commented Feb 24, 2020 • edited Loading

lecy commented Feb 24, 2020

castower commented Feb 24, 2020 via email

castower commented Feb 24, 2020

sunaynagoel commented Feb 24, 2020 • edited Loading

lecy commented Feb 24, 2020

castower commented Feb 24, 2020

castower commented Feb 24, 2020

jmacost5 commented Feb 24, 2020

lecy commented Feb 24, 2020

lecy commented Feb 24, 2020 • edited Loading

jmacost5 commented Feb 24, 2020

lecy commented Feb 24, 2020

jmacost5 commented Feb 24, 2020

lecy commented Feb 24, 2020

castower commented Feb 24, 2020 via email

castower commented Feb 25, 2020 • edited Loading

lecy commented Feb 25, 2020

castower commented Feb 25, 2020

lecy commented Feb 25, 2020

castower commented Feb 25, 2020

castower commented Feb 25, 2020

sunaynagoel commented Feb 25, 2020

lecy commented Feb 25, 2020

castower commented Feb 26, 2020

sunaynagoel commented Feb 21, 2020 •

edited

Loading

jmacost5 commented Feb 21, 2020 •

edited by lecy

Loading

lecy commented Feb 21, 2020 •

edited

Loading

sunaynagoel commented Feb 23, 2020 •

edited

Loading

sunaynagoel commented Feb 23, 2020 •

edited

Loading

lecy commented Feb 24, 2020 •

edited

Loading

sunaynagoel commented Feb 24, 2020 •

edited

Loading

lecy commented Feb 24, 2020 •

edited

Loading

castower commented Feb 25, 2020 •

edited

Loading