StatsTools

### Working with Messy Text

Heyo! I am doing my best to procrastinate here on a blustery Tuesday afternoon. So, I decided to share some code I’ve put together that solves problems in R that I used to do in perl. HTML or C++ was probably my first real language, but I love the heck out of perl. It’s never done me wrong (unlike you PHP).

Anyways! The context of this project is that we are developing a dictionary of words to complement the work done by Jonathan Haidt and Jesse Graham – learn more. I had a student who was interested in Moral Foundations Theory and its relationship to language, and we had tested some of the dictionary and found it to be frustratingly obtuse. Meaning, that a lot of the words in it are great, but not things that people like, college freshman, or even me were likely to say. She’s moved on to working with the founder of the LIWC – and even worked on the newest version of it :small brag:.

Now I have a second student who’s helping finish up some work on the dictionary, to see if what we were doing is worthwhile (spoiler alert: I don’t know). However, I thought I might share some code we were using and it’s context for people who are also trying to get into doing some of this text mining/cleaning/editing in R. You can find all the materials for this project, including the code in context of our messy paper, on GitHub.

Here’s a view of what the data looks like (this isn’t even the messiest part, and part 2 of our study uses full written paragraphs):

> head(noout1$Q27) [1] "doctors, babysitting" [2] "criminals, doctors, shootings, medicine " [3] "Health" [4] "physical healthiness, mental healthiness" [5] "hurt, effect, love, protect" [6] "hurt, depression, pain" So, couple things we have to deal with: • Mixed case • Punctuation • Stemming (affixes) Now, don’t hate on me folks, but I love a good loop. I could probably do this with the apply family, but I didn’t: > ##stem the data library(corpus) was loaded earlier > for (i in 1:nrow(noout1)) { + noout1$Q27[i] = paste(unlist(
+     text_tokens(noout1$Q27[i], stemmer = "en")), collapse = " ") + } Unpacking what this does: • Loops over each participant’s answers in Q27. I did this because text_tokens returns a list of lists, which I personally find troublesome to deal with, and I wanted to retain each persons answers in one cell. • Uses text_tokens to “tokenize” or de-affix the data. stemmer = "en" is an argument to stem the words in English. • Unlists the list returned by text_tokens. • Pastes the updated data back to one cell. Be sure to use collapse here and not sep, as we want 1 item returned, and sep would just stick spaces between items if there were more than one. ##one example > paste(unlist( + text_tokens(noout1$Q27[4], stemmer = "en")), collapse = " ")
[1] "physic healthi , mental healthi" ##one string
> paste(unlist(
+     text_tokens(noout1$Q27[4], stemmer = "en")), sep = " ") [1] "physic" "healthi" "," "mental" "healthi" ##five strings Let’s look at the data now: > head(noout1$Q27)
[1] "doctor , babysit"
[2] "crimin , doctor , shoot , medicin"
[3] "health"
[4] "physic healthi , mental healthi"
[5] "hurt , effect , love , protect"
[6] "hurt , depress , pain"

You can see that the words have been stemmed and are now in lower case. We haven’t removed punctuation yet. There’s lots of ways to do that, but since one of the next steps does it for me, I won’t cover those. The next step requires the tm library, although I bet the corpus library also does similar steps, just more familiar with tm. We will create a corpus out of the vector of participant answers I have:

> ##create a corpus
> harm_corpus = Corpus(VectorSource(noout1$Q27)) > harm_TDM = as.matrix(TermDocumentMatrix(harm_corpus, + control = list(removePunctuation = TRUE, + stopwords = TRUE))) The Corpus step simply creates a big list of all the “documents” (here, each participant is treated as a separate document, which is what I want) from a Vector, rather than opening separate documents in a file. The TermDocumentMatrix function creates a giant matrix wherein: • Terms (words) are rows • Documents (participants) are columns • Each row, column combination stores the number of times a term appeared in each document. These can get real big, real fast, fyi. The nice thing about the TermDocumentMatrix function is that it handled the punction for me by using removePunctuation = TRUE and also dealt with the stop words. Stop words are things like the, an, a, of that are traditionally removed from these types of analyses that focus on content words over helper words. > harm_TDM[1:6, 1:6] Docs Terms 1 2 3 4 5 6 babysit 1 0 0 0 0 0 doctor 1 1 0 0 0 0 crimin 0 1 0 0 0 0 medicin 0 1 0 0 0 0 shoot 0 1 0 0 0 0 health 0 0 1 0 0 0 Great, now what can I do with that? Everything! Here’s what we did. Found the most frequent words by creating a data.frame that was a frequency table (thanks StackOverflow!): > ##view the most frequent words > harm_freq = data.frame(Word = rownames(harm_TDM), + Freq = rowSums(harm_TDM), + row.names = NULL) > harm_freq$Word = as.character(harm_freq$Word) > harm_freq$percent = harm_freq$Freq/nrow(noout1) *100 > head(harm_freq) Word Freq percent 1 babysit 1 0.2298851 2 doctor 52 11.9540230 3 crimin 6 1.3793103 4 medicin 5 1.1494253 5 shoot 1 0.2298851 6 health 16 3.6781609 Doctor is in the top 5, other big words included hurt, love, pain, and hospit(al). In this prompt, participants were free associating with the harm/care foundation. Now the tricky part was to combine this data back with my other data frame that included particiapnt information, including their moral foundation questionnaire scores: > harm_words = harm_freq$Word[harm_freq$percent >=1] > head(harm_words) [1] "doctor" "crimin" "medicin" "health" "mental" "physic" First, I created a list of harm words that were mentioned at least 1% of the time. I use the transpose function t() to flip the dataset from rows as words, to columns as words to maintain “tidy-ish” data (i.e., each participant is their own row). Then I subset out the dataset to only be my top words: > harm_TDM = as.data.frame(t(harm_TDM)) > harm_TDM = harm_TDM[ , harm_words] > harm_TDM[1:6, 1:6] doctor crimin medicin health mental physic 1 1 0 0 0 0 0 2 1 1 1 0 0 0 3 0 0 0 1 0 0 4 0 0 0 0 1 1 5 0 0 0 0 0 0 6 0 0 0 0 0 0 Now, we can cbind our harm dataset with the other relevant columns for harm. > harm_final = cbind(noout1[ , c("ResponseId", "Q15_1", "Q23", "harmMFQ")], + harm_TDM) > harm_final[1:6, 1:6] ResponseId Q15_1 Q23 harmMFQ doctor crimin 1 R_2BkYH8gEtZMEQnG 8 Democrat 18 1 0 2 R_qCTluTnJCgGFqXT 6 Democrat 18 1 1 3 R_11hglRVpaSclG0K 5 Republican 13 0 0 4 R_3kMsBrEjwDtu5iJ 6 Independent 16 0 0 5 R_swkbG8889YEOxoZ 3 Republican 14 0 0 6 R_s682tzsz2YIkwJX 10 Democrat 17 0 0 So, now you too can create participant term-document matrices! In later posts, I’ll show you how we are going to use this information to create an updated dictionary and examine if that dictionary relates to the Moral Foundations Questionnaire. This task will involve some correlations, but also a multi-trait multi-method analysis using lavaan so stay tuned if you are interested in structural equation modeling. ### New Publication – Detect Low Quality Data My coauthor John Scofield and I just had a publication accepted at Behavior Research Methods – you can check out the publication preprint at OSF. We thew together a website for the paper that summarizes everything we found, as well as puts all the materials together in one place – check it out. We create a really nice R function to help you detect low quality data, which you can find on GitHub, and I even made a video that explains all the parts to the function at YouTube. If you aren’t a R person, you can use our Shiny App, download the code, and watch the YouTube video that explains everything to you. Enjoy! ### Citations in R Markdown + Papaja Heyo! I wanted to write a post about some of the quirky things I’ve found with writing manuscripts in R Markdown, as well as provide a solution to a problem that someone else might be having. Update: The csl file I describe below is a special formatted one, which was shared with me. You can download it from GitHub to try the suggestions below. Update 2: Turns out, potentially, the suggestions from the manual are not working correctly, as Frederik has checked it out and opened an issue on github. I’ll write a new post when there are updates! First, let me tell you how much I love Frederik Aust’s papaja package for R. I had been trying to integrate open science and transparency in our lab, which was helped by the switch to R to track what we were doing in our data analysis. I heard about papaja through a former student, and I jumped in head first. I know it’s helped us think a LOT about reproducibility and replication, as we want people to be able to track what we did and avoid p-hacking in our papers. Having a workflow that is integrated throughout the manuscript really forces you to think about how you are presenting your data and knowing that others can view it especially forces you to be clear about what you did. We’ve fully embraced working transparently through Open Science Foundation integration, much of work in on GitHub, and we are writing manuscripts with papaja to make it more obvious what is what. Before doing that, I had started learning markdown, and although I’ve been using it for a bit now, I still feel like a noob. Mix LaTeX in there, and even more so. Thankfully, I have some very awesome twitter friends that help me when I get stuck in trying to do something … like trying to stick a % symbol in a column name for a table. Whew. One thing I wish were a little bit different is citations. Currently, papaja using pandoc-citeproc to create the text referencing for knitting to PDF or Word. The problem with this is that any time you have the same author last names (like Erin Buchanan and Tom Buchanan), you automatically get E. Buchanan and T. Buchanan in the in-text referencing. That is APA style but reviewers and the like do not like it. Real APA != to Used APA. The other issue stems from the fact that you will get the the first initials, even if the other author name match is in second or third place. Therefore, if I cite myself and cite Tom but he only appears as second author, I will still get E. Buchanan in the in text citation. That’s probably also a correct interpretation of APA but ain’t worth fighting reviewers over. Additionally, the absolute name matching often forces us to fix bibtex files a lot over things like Buchanan, E. versus Buchanan, E.M. versus Buchanan, Erin etc. Many different permutations of one person’s name via differences in doi citations can be tedious to fix. Therefore! I checked out the papaja manual – which is stellar – to see if there was some other way to do it. I also googled this, but really got stuck with the translation of latex to markdown. The manual suggests you can do this: --- output: papaja::apa6_pdf: citation_package: biblatex --- To pass the citations through a different processor. Great! I will try that. Latexmk: This is Latexmk, John Collins, 19 Jan. 2017, version: 4.52c. Latexmk: applying rule 'biber QWERTY'... Rule 'biber QWERTY': File changes, etc: Non-existent destination files: 'QWERTY.bbl' ------------ Run number 1 of rule 'biber QWERTY' ------------ ------------ Running 'biber "QWERTY"' ------------ INFO - This is Biber 2.7 INFO - Logfile is 'QWERTY.blg' ERROR - QWERTY.bcf is malformed, last biblatex run probably failed. Deleted QWERTY.bbl INFO - ERRORS: 1 Latexmk: biber found malformed bcf file for 'QWERTY'. I'll ignore error, and delete any bbl file. Rule 'pdflatex': File changes, etc: Non-existent destination files: 'QWERTY.pdf' ------------ Run number 1 of rule 'pdflatex' ------------ Biber error: [427] Utils.pm:180> ERROR - QWERTY.bcf is malformed, last biblatex run probably failed. Deleted QWERTY.bbl Latexmk: applying rule 'pdflatex'... ------------ Running 'pdflatex -halt-on-error -interaction=batchmode -recorder "QWERTY.tex"' ------------ This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) (preloaded format=pdflatex) restricted \write18 enabled. entering extended mode Latexmk: Non-existent bbl file 'QWERTY.bbl' No file QWERTY.bbl. === TeX engine is 'pdfTeX' Biber error: [427] Utils.pm:180> ERROR - QWERTY.bcf is malformed, last biblatex run probably failed. Deleted QWERTY.bbl Latexmk: Errors, so I did not complete making targets Collected error summary (may duplicate other messages): pdflatex: Command for 'pdflatex' gave return code 1 Refer to 'QWERTY.log' for details Latexmk: Use the -f option to force complete processing, unless error was exceeding maximum runs of latex/pdflatex. ! LaTeX Error: Command \c@author already defined. Or name \end... illegal, see p.192 of the manual. Error: Failed to compile QWERTY.tex. See QWERTY.log for more info. Execution halted Balls. I searched this error for a while and found: 1) update LaTeX: check, 2) figure out why your bibtext was messed up: check … tried with only one reference and still crashed, and 3) other stuff I don’t remember. When I tried a separate markdown, thinking the one that I had open was the problem, I got the actual citation codes, rather than the text: Researchers discovered that online data collection can be advantageous over laboratory and paper data collection, as it is often cheaper and more efficient (Ilieva2001;Schuldt1994;Reips2012) I thought maybe it was my computer, so one of my coauthors tried it. Same as the first error. Maybe it’s a mac thing? Another coauthor with a mac, got the second error. I’m sad to say that I don’t have an answer for either of these problems – from the looks of it, I’m following the guidelines suggested, but both problems pop up. I would love to hear if you know why. Enter Julia! Julia helped find a work around for the issue. In the head of your markdown file (note I used some … to shorten some of what papaja does for you automatically): ... bibliography : ["q_bib.bib"] ... output : papaja::apa6_pdf replace_ampersands: yes csl : apa6.csl --- And then be sure to put the apa6.csl in the same folder as your markdown. Now, you can confuse people with all your Buchanans, Logans, Cohens, and Fritzs. Or, in our case, we can make Reviewer #2 happy and annoy the copy editor. Note: I had to update papaja to get this solution to work, as the replace ampersands did not work the first time. ### MOTE – GitHub to R Ready! Heyo! I have so much stuff backlogged to blog about – especially that we are working on fully integrating to OSF and putting up preprints of the cool work we are doing! But this blog post is reserved for HOW EXCITED I AM to announce that MOTE is ready to go to import into R. Run this code in your R: install.packages(“devtools”) ##only needed if you do not have it yet devtools::install_github(“doomlab/MOTE”) Remember that “” sometimes does not copy correctly into R. Go nuts! Ask questions! Give feedback! One thing I did not talk about in the video is a limitation of V in chi-square. Due to the distribution of chi-square, V confidence intervals are only useful on smaller r x c combinations (like 2X2, 3×3). After you hit about 4 rows/columns, the distribution flattens out, and the calculated confidence interval is not around the V value. For example, a X2 of 14 with sample size 100, with four rows and columns gives you: v.chi.sq(x2 = 14, n = 100,r = 4, c = 4, a = .05)$v
[1] 0.6480741

$vlow [1] 0.1732051$vhigh
[1] 0.3241347

$n [1] 100$df
[1] 9

$x2 [1] 14$p
[1] 0.1223252

Warning message:
The size of the effect combined with the degrees of freedom is too small to determine a lower confidence limit for the ‘alpha.lower’ (or the (1/2)(1-‘conf.level’) symmetric) value specified (set to zero).

As you can see, this is a limitation of confidence intervals on chi-square. Also, I found more typos :|.

Go check out github:

https://github.com/doomlab/MOTE

Go check out the video on how to install and the history of MOTE: