-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAB 04 #8
Comments
@lecy not sure what happened there but anciently I opened two LAB 04 issues. I closed one but thought you may want to delete it. |
You are throwing but not catching! You need to assign the trimmed missions back to a new variable. Try: dat$mission <- trimws( dat$mission, "r")
grepl( " $", x=dat$mission) %>% sum() Note that you are counting mission statements with a single white space and no text. That is different than "how many strings have trailing white spaces". You would need to specify [ any text ] [ white space ] [ end of line ]. |
Thank you this helps. As far as "strings with trailing spaces" does this work ? |
It could be a wildcard, or a selector set. I have not tried this code so this is more pseudocode. "* $"
"[alphanumeric] ^" But basically something that says "any letter number or punctuation, then a space, then end of line." |
Part II I have conceptual question about creating the dictionary. What happens if we don't create a dictionary for our corpus?
|
The dictionary simplifies the data by turning these compound words into a single word. It's part of disambiguation. If you don't apply it, your data is just a little noisier. It depends on the application - if you are very interested in a specific concept in your corpus ("President Bush") you might spend a lot of time making sure you capture all of the variants ("GW", "George W Bush", "Bush Jr", NOT "George HW Bush", etc.). And correct - the dictionary is mapping all of the phrases on the right to the single term on the left. It is a find-and-replace operation. |
I am not understanding how to solve the second part, I am getting confused how to start it. |
Hello @jmacost5 for Part II, I started with the codes provided in the instructions and skipped the sampling part. Hope this helps. |
@jmacost5 I'm going to need more information to answer your question. The instructions are:
Which part is unclear? |
#Challenge Question |
URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
dat <- readRDS(gzcon(url( URL )))
table( dat$code01 )
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
5325 7603 922 2359 1571 1378 699 252 633 417 943 700 828 6488 3683 7782 530 607 2483 2199 295 78 2261 3778 345 614 |
Thank. I had to reload the dataset but now it is showing all the value. |
I guess the part where we make the compound words into single words. I do not understand how to do that. Do I make a function that removes all of them from the dictionary or just the few that are listed? |
I am not understanding if I am missing something when it comes to removing the compound words in the examples the removing of code is completely different # remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )
# remove punctuation
tokens <- tokens( corp, what="word", remove_punct=TRUE )
head( tokens ) |
@lecy The packages igraph and networkD3 are not available for the R version I have (3.6.1). Is there any way around ? |
@sunaynagoel Please try installing via their GitHub version: https://github.com/igraph/rigraph devtools::install_github("gaborcsardi/pkgconfig")
devtools::install_github("igraph/rigraph") NetworkD3That's weird about D3. Is is that package, or a required package, that is not available? You might try to download the Windows binary and install locally (packages >> install from local files)? https://cran.r-project.org/web/packages/networkD3/index.html |
Here is the step where you translate compound words into single words: my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
united_states = c("united states"),
high_school=c("high school"),
non_profit=c("non-profit", "non profit"),
stem=c("science technology engineering math",
"science technology engineering mathematics" ),
los_angeles=c("los angeles"),
ny_state=c("new york state"),
ny=c("new york")
))
# apply the dictionary to the text
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens ) Your job is to generate n-grams to find phrases that should be combined into single words. That step helps generate options for you to explore, then you would manually translate your selections to the dictionary list. You will add additional phrases or words to the dictionary similar to the examples: non_profit=c("non-profit", "non profit") When applied these multi-word phrases are replaced in the text. These steps are other pre-processing steps: # remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )
# splits each sentence into a list of words
# remove punctuations first
tokens <- tokens( corp, what="word", remove_punct=TRUE )
# apply the dictionary to the text
tokens <- tokens_compound( tokens, pattern=my_dictionary ) Try: |
I wanted to know if there was a way to see the terms that are on the document or am I missing it from what we previously did ? I am trying to identify the terms used for the third part. |
@jmacost5 I'm not sure what you mean by "see the terms that are on the document" ? We are working with mission statements .After loading the data you can view the mission statements as: dat$mission If you want to browse in a spreadsheet view you can type: View( dat ) Or you could write the data as a CSV file and open in excel: getwd() # where file will write to
write.csv( dat, "missions.csv" ) |
Hello all, I've run into a dilemma with solving problem one. It's my understanding that adding "^" to a pattern ensures that it checks to make sure that the pattern begins the sentence. However, the instructions also say to ignore capitalization and I can't figure out how to get the code to ignore the capitalization, it only finds an exact match when ^ is added. Is it alright if I run searches for the different capitalization styles separately and then just sum them? Thanks! |
Hi Courtney, I tried, "^[Tt]+[Oo] " I believe it worked. |
@jrcook15 Thank you! That does seem to work, but I can't tell if it excludes values like 'Tooele' that start with 'to'. I have spaces following my codes currently: such as: "^to " to try to avoid this. I'm probably overthinking it! |
There is a space after the [Oo] before the second ", that should eliminate 'Tooele'. |
@jrcook15 ah, okay!! Thank you :) |
I am confused on how to look for the terms instead for the word "black". I can honestly only think of African American |
Hello all,
But I keep getting this error message:
I'm not sure how to fix this. |
@jrcook15 @castower using ignore.case=T also works to make sure to or TO or To or tO all are considered. Also instead of using a space after "o" I used \b instead. It seemed to work for me. |
@castower I was getting the same error as well. But removing the criteria worked for me |
@sunaynagoel that fixed it! I've been working on this for hours now, lol. Thank you so much! |
That is the hard and interesting part of the assignment. One thing you learn quickly when working with text is the usefulness of iteration. We know that "black" is ambiguous (it could be used for a lot of things in mission statements), but "African American" is probably not. So search for missions that contain that term, then look for other key words or phrases. You just keep adding phrases until the process is not improving outcomes much at all. You can also google some topics to try and find some words or phrases. If you try "nonprofit + african american" you get: This gives you ideas like "black heritage", "black lives", and "women of color". It will just be trial and error, making sure you don't add words that add non-matches and trying no to miss words that add a lot of matches. |
This is a package required for challenge question to make word networks. |
Hello all,
Is this okay? I still have a table produced, but I don't know if this will cause problems. Thanks! |
@castower That warning occurs at this step? # remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 ) You can omit this step, and later convert tokens to all lower case: # convert missions to all lower-case
dat$mission <- tolower( dat$mission )
# after tokenization before counting terms:
tokens <- tokens_tolower( tokens, keep_acronyms=TRUE ) But substantively it would not impact much for this lab to leave it in the original order. I suspect your results would not change either way. |
@sunaynagoel Were you able to install either package? You don't need both - they will create similar network diagrams (the D3 version is interactive in an RMD HTML document, that's the only difference). These are both popular packages, so I would be surprised if neither is working. |
Yes, thank you!
…On Sun, Feb 23, 2020, 8:55 PM Jesse Lecy ***@***.***> wrote:
@castower <https://github.com/castower> That warning occurs at this step?
# remove mission statements that are less than 3 sentences longcorp <- corpus_trim( corp, what="sentences", min_ntoken=3 )
You can omit this step, and later convert tokens to all lower case:
# convert missions to all lower-case dat$mission <- tolower( dat$mission )
# after tokenization before counting terms: tokens <- tokens_tolower( tokens, keep_acronyms=TRUE )
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8?email_source=notifications&email_token=AM6K2WQG2PXTXBJFRROVTOLRENHMVA5CNFSM4KYTWC42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMWT3EA#issuecomment-590167440>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AM6K2WS2I3ZGE3DZPPV7SR3RENHMVANCNFSM4KYTWC4Q>
.
|
@lecy for some reason,
did not work, however I changed it to
and it worked fine. It did change my final numbers of frequency for the top 10 keywords slightly, but not very much. Thanks! |
I was able to download igraph. Thanks |
@castower Ok, great. I'll make a note of that error. It makes sense why removing capitalization would hinder efforts at automatically indentifying sentences. Humans would be pretty good at knowing a sentence had ended, but for computers you might have periods representing abbreviation in the middle of a sentence.
So period followed by lower-case suggests it is mid-sentence. If you remove cases that would be hard to identify, so you would end up with different splits. The joys of text analysis! |
@jmacost5 a key term you can search for a lot of organizations that serve Black/African American populations is diaspora or more specifically African/Black diaspora. If you search diaspora generally, you can filter out the organizations that refer to other diasporas. Hope that helps! |
@lecy to clarify, for the challenge question are we examining the subset database of data related to the organizations serving Black communities or the entire database? Thanks! |
@castower The challenge questions would use the entire database. |
@jmacost5 Did you include dplyr in your load libraries chunk? |
Yes I did and I included it in my code, I am getting an error about 'corp' now, is there a package that I am missing, I put dpylr, pander, and quantda |
"quanteda" or "quantda" ? |
quanteda |
I would need more to go to diagnose the problem (you haven't provided a lot of information or your code so it is a bit of a guessing game). Do you want to send me the RMD file? |
@lecy, I thought so, but wanted to check! Thank you so much!
…On Mon, Feb 24, 2020, 8:11 AM Jesse Lecy ***@***.***> wrote:
@castower <https://github.com/castower> The challenge questions would use
the entire database.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8?email_source=notifications&email_token=AM6K2WQV4FOGSKN7NLSEAJDREPWUVA5CNFSM4KYTWC42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMYOYQA#issuecomment-590408768>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AM6K2WV77WGVCLIJTX3EJK3REPWUVANCNFSM4KYTWC4Q>
.
|
Hello,
All the previous code works, but then when I reach this code:
|
@castower Can you send me the file by email please? Note that in the dataset code01 and codedef01 tell you the subsectors if you want to identify them that way: > head( dat )
ein orgname
1 311767271 NIA PERFORMING ARTS
2 463091113 THE YOUNG ACTORS GUILD INC
3 824331000 RUTH STAGE INC
4 823821811 STRIPLIGHT COMMUNITY THEATRE INC
5 911738135 NU BLACK ARTS WEST THEATRE
6 824668235 OLIVE BRANCH THEATRICALS INC
mission
1 a community based art organization that inspires, nutures,educates and empower artist and community.
2 we engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. we produce community theater productions for children as well as educational theater camps and workshops.
3 theater performances and performing arts
4
5
6 to produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability.
code01 codedef01 code02 codedef02 orgpurposecharitable
1 A Arts, Culture, and Humanities A65 Theater 1
2 A Arts, Culture, and Humanities A65 Theater 0
3 A Arts, Culture, and Humanities A65 Theater 1
4 A Arts, Culture, and Humanities A65 Theater 1
5 A Arts, Culture, and Humanities A65 Theater 1
6 A Arts, Culture, and Humanities A65 Theater 0
|
@lecy thanks! I just sent over my RMD file. I will look into using the codes. -Courtney |
@castower Just sent it back. A preview of one of the semantic networks: |
Hello all, So I have been working with the stringr functions a little more and I'm a bit confused what I'm doing wrong. I have created the following test data set:
and I am trying to extract everything after 'hello' so that I can get an output of
However, when I run the following:
All that I'm getting is:
Any tips on what I am doing wrong? |
Also, as a note, I tried str_split instead and for some reason, it deletes 'my': code:
Output:
|
@castower I am not sure what is wrong. But your question was interesting enough for me to try it out on my own.
[[1]] But when I try word (),it works
[1] "my name is Courtney" It would be nice to write a function which can eliminate first word in every sentence. |
I am going to be honest that when people immediately default to tidyverse packages I feel a little like an old man. Get off my lawn, Hadley Wickham! I find that tidy packages are really great at scaling operations. Once you know how to do them, then they make it easier and faster to accomplish. They are not always great when you are just learning a new skill in R because they try to be clever and protect you from some of the complicated parts of the code, and they are written in a way that tries to generalize each step at scale. As a result, you lose some of the intuition about what is happening. For example, So let me conclude this soap box by saying it is sometimes helpful to start with core R functions because they tend to operate at the most basic level, and can be helpful for understanding problems. Your issue here, you want a process to remove the first word from each sentence. My question would be, what is your pseudocode. What do you mean by first word? Does it have to be "hello", or can it be any word? How do you operationalize the first word? Try something like this: > test <- c("hello my name is Courtney")
>
> # non-generalizable version - just remove hello
> gsub( "^hello ", "", test )
[1] "my name is Courtney"
>
> # > args( strsplit )
> # function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
> # split everything seperated by a space into distinct words
> # this is what "tokenization" does
> #
> x.split.list <- strsplit( test, " " )
> x.split.list
[[1]]
[1] "hello" "my" "name" "is" "Courtney"
>
>
> # extract the vector from the list
> x.split <- x.split.list[[ 1 ]]
> new.x <- x.split[ -1 ] # drop first word
> new.x
[1] "my" "name" "is" "Courtney"
>
> # combine vector elements back into a single string:
> # when you add collapse as an argument to paste it
> # mashes all elements of a vector into a single string
>
> paste0( new.x, collapse=" " )
[1] "my name is Courtney" Note that parentheses in regular expressions are not like putting things in quotes. It actually atomizes the words in the parentheses into individual letters rather than isolating the specific word. So this expression: gsub( "^hello ", "", test )
x.split.list <- strsplit( test, "[hello]" ) Would split all of the text by H, E, L, or O and return all of the new atomized strings. |
@lecy thanks so much for the detailed response! It really helped me understand what's going on "behind the scenes". I agree, the tidyverse "masks" a lot of the details when I try to follow exactly what is going on. Thanks again! @sunaynagoel thanks for the word() tip. I had not tried that function yet, but it's very useful! |
Part 1
#3
I think I have identified how many strings have trailing white spaces. I tried to remove them using trimws().My question is, once I have removed white spaces, if I run my code to find white spaces again in Mission field it should be returning zero or no matches. But that is not the case.
_ _, _ _, _ _, _ _, _ _ and _ _[1] 3464
Even after running this code the command
it return the same result.
_ _, _ _, _ _, _ _, _ _ and _ _[1] 3464
Not sure, what is going wrong.
The text was updated successfully, but these errors were encountered: