Title: | R Tools for Text Matrices, Embeddings, and Networks |
---|---|
Description: | This is a collection of functions optimized for working with with various kinds of text matrices. Focusing on the text matrix as the primary object - represented either as a base R dense matrix or a 'Matrix' package sparse matrix - allows for a consistent and intuitive interface that stays close to the underlying mathematical foundation of computational text analysis. In particular, the package includes functions for working with word embeddings, text networks, and document-term matrices. Methods developed in Stoltz and Taylor (2019) <doi:10.1007/s42001-019-00048-6>, Taylor and Stoltz (2020) <doi:10.1007/s42001-020-00075-8>, Taylor and Stoltz (2020) <doi:10.15195/v7.a23>, and Stoltz and Taylor (2021) <doi:10.1016/j.poetic.2021.101567>. |
Authors: | Dustin Stoltz [aut, cre] |
Maintainer: | Dustin Stoltz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1 |
Built: | 2025-03-04 06:05:34 UTC |
Source: | https://gitlab.com/culturalcartography/text2map |
A dataset containing juxtaposing pairs of English words
for 26 semantic relations. These anchors are used with
the get_anchors()
function, which can then be used with
the get_direction()
function. These have been collected
from previously published articles and should be used
as a starting point for defining a given relation in
a word embedding model.
anchor_lists
anchor_lists
A data frame with 303 rows and 4 variables.
Variables:
add. words to be added (or the positive direction)
subtract. words to be subtract (or the negative direction)
relation. the relation to be extracted, 26 relations available
domain. 6 broader categories within which each relation falls
CoCA, get_direction, get_centroid, get_anchors
Concept Mover's Distance classifies documents of any length along a continuous measure of engagement with a given concept of interest using word embeddings.
CMDist( dtm, cw = NULL, cv = NULL, wv, missing = "stop", scale = TRUE, sens_interval = FALSE, alpha = 1, n_iters = 20L, parallel = FALSE, threads = 2L, setup_timeout = 120L ) cmdist( dtm, cw = NULL, cv = NULL, wv, missing = "stop", scale = TRUE, sens_interval = FALSE, alpha = 1, n_iters = 20L, parallel = FALSE, threads = 2L, setup_timeout = 120L )
CMDist( dtm, cw = NULL, cv = NULL, wv, missing = "stop", scale = TRUE, sens_interval = FALSE, alpha = 1, n_iters = 20L, parallel = FALSE, threads = 2L, setup_timeout = 120L ) cmdist( dtm, cw = NULL, cv = NULL, wv, missing = "stop", scale = TRUE, sens_interval = FALSE, alpha = 1, n_iters = 20L, parallel = FALSE, threads = 2L, setup_timeout = 120L )
dtm |
Document-term matrix with words as columns. Works with DTMs
produced by any popular text analysis package, or using the
|
cw |
Vector with concept word(s) (e.g., |
cv |
Concept vector(s) as output from |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
missing |
Indicates what action to take if words are not in embeddings.
If |
scale |
Logical (default = |
sens_interval |
logical (default = |
alpha |
If |
n_iters |
If |
parallel |
Logical (default = |
threads |
If |
setup_timeout |
If |
CMDist()
requires three things: a (1) document-term matrix (DTM), a (2)
matrix of word embedding vectors, and (3) concept words or concept vectors.
The function uses word counts from the DTM and word similarities
from the cosine similarity of their respective word vectors in a
word embedding model. The "cost" of transporting all the words in a
document to a single vector or a few vectors (denoting a
concept of interest) is the measure of engagement, with higher costs
indicating less engagement. For intuitiveness the output of CMDist()
is inverted such that higher numbers will indicate more engagement
with a concept of interest.
The vector, or vectors, of the concept are specified in several ways. The simplest involves selecting a single word from the word embeddings, the analyst can also specify the concept by indicating a few words. The algorithm then splits the overall flow between each concept word (roughly) depending on which word in the document is nearest. The words need not be in the DTM, but they must be in the word embeddings (the function will either stop or remove words not in the embeddings).
Instead of selecting a word already in the embedding space, the function can
also take a vector extracted from the embedding space in the form of a
centroid (which averages the vectors of several words) ,a direction (which
uses the offset of several juxtaposing words), or a region (which is built
by clustering words into $k$ regions). The get_centroid()
,
get_direction()
, and get_regions()
functions will extract these.
Returns a data frame with the first column as document ids and each
subsequent column as the CMD engagement corresponding to each
concept word or concept vector. The upper and lower bound
estimates will follow each unique CMD if sens_interval = TRUE
.
Dustin Stoltz and Marshall Taylor
Stoltz, Dustin S., and Marshall A. Taylor. (2019)
'Concept Mover's Distance' Journal of Computational
Social Science 2(2):293-313.
doi:10.1007/s42001-019-00048-6.
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Integrating semantic
directions with concept mover's distance to measure binary concept
engagement.' Journal of Computational Social Science 1-12.
doi:10.1007/s42001-020-00075-8.
Taylor, Marshall A., and Dustin S. Stoltz.
(2020) 'Concept Class Analysis: A Method for Identifying Cultural
Schemas in Texts.' Sociological Science 7:544-569.
doi:10.15195/v7.a23.
CoCA, get_direction, get_centroid
# load example word embeddings data(ft_wv_sample) # load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) # example 1 cm.dists <- CMDist(dtm, cw = "space", wv = ft_wv_sample ) # example 2 space <- c("spacecraft", "rocket", "moon") cen <- get_centroid(anchors = space, wv = ft_wv_sample) cm.dists <- CMDist(dtm, cv = cen, wv = ft_wv_sample )
# load example word embeddings data(ft_wv_sample) # load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) # example 1 cm.dists <- CMDist(dtm, cw = "space", wv = ft_wv_sample ) # example 2 space <- c("spacecraft", "rocket", "moon") cen <- get_centroid(anchors = space, wv = ft_wv_sample) cm.dists <- CMDist(dtm, cv = cen, wv = ft_wv_sample )
CoCA outputs schematic classes derived from documents' engagement
with multiple bi-polar concepts (in a Likert-style fashion).
The function requires a (1) DTM of a corpus which can be obtained using any
popular text analysis package, or from the dtm_builder()
function, and (2)
semantic directions as output from the get_direction()
.
CMDist()
works under the hood. Code modified from the corclass
package.
CoCA( dtm, wv = NULL, directions = NULL, filter_sig = TRUE, filter_value = 0.05, zero_action = c("drop", "ownclass") ) coca( dtm, wv = NULL, directions = NULL, filter_sig = TRUE, filter_value = 0.05, zero_action = c("drop", "ownclass") )
CoCA( dtm, wv = NULL, directions = NULL, filter_sig = TRUE, filter_value = 0.05, zero_action = c("drop", "ownclass") ) coca( dtm, wv = NULL, directions = NULL, filter_sig = TRUE, filter_value = 0.05, zero_action = c("drop", "ownclass") )
dtm |
Document-term matrix with words as columns. Works with DTMs
produced by any popular text analysis package, or you can use the
|
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
directions |
direction vectors output from get_direction() |
filter_sig |
logical (default = TRUE), sets 'insignificant' ties to 0 to decrease noise and increase stability |
filter_value |
Minimum significance cutoff. Absolute row correlations below this value will be set to 0 |
zero_action |
If 'drop', CCA drops rows with 0 variance from the analyses (default). If 'ownclass', the correlations between 0-variance rows and all other rows is set 0, and the correlations between all pairs of 0-var rows are set to 1 |
Returns a named list object of class CoCA
. List elements include:
membership: document memberships
modules: schematic classes
cormat: correlation matrix
Dustin Stoltz and Marshall Taylor
Taylor, Marshall A., and Dustin S. Stoltz.
(2020) 'Concept Class Analysis: A Method for Identifying Cultural
Schemas in Texts.' Sociological Science 7:544-569.
doi:10.15195/v7.a23.
Boutyline, Andrei. 'Improving the measurement of shared cultural
schemas with correlational class analysis: Theory and method.'
Sociological Science 4.15 (2017): 353-393.
doi:10.15195/v4.a15
#' # load example word embeddings data(ft_wv_sample) # load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) # create semantic directions gen <- data.frame( add = c("woman"), subtract = c("man") ) die <- data.frame( add = c("alive"), subtract = c("die") ) gen_dir <- get_direction(anchors = gen, wv = ft_wv_sample) die_dir <- get_direction(anchors = die, wv = ft_wv_sample) sem_dirs <- rbind(gen_dir, die_dir) classes <- CoCA( dtm = dtm, wv = ft_wv_sample, directions = sem_dirs, filter_sig = TRUE, filter_value = 0.05, zero_action = "drop" ) print(classes)
#' # load example word embeddings data(ft_wv_sample) # load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) # create semantic directions gen <- data.frame( add = c("woman"), subtract = c("man") ) die <- data.frame( add = c("alive"), subtract = c("die") ) gen_dir <- get_direction(anchors = gen, wv = ft_wv_sample) die_dir <- get_direction(anchors = die, wv = ft_wv_sample) sem_dirs <- rbind(gen_dir, die_dir) classes <- CoCA( dtm = dtm, wv = ft_wv_sample, directions = sem_dirs, filter_sig = TRUE, filter_value = 0.05, zero_action = "drop" ) print(classes)
Given a document-term matrix or a document-similarity matrix, this function returns specified text network-based centrality measures. Currently, this includes degree, eigenvector, betweenness, and spanning.
doc_centrality(mat, method, alpha = 1L, two_mode = TRUE)
doc_centrality(mat, method, alpha = 1L, two_mode = TRUE)
mat |
Document-term matrix with terms as columns or a document-similarity matrix with documents as rows and columns. |
method |
Character vector indicating centrality method, including "degree", "eigen", "span", and "between". |
alpha |
Number (default = 1) indicating the tuning parameter for weighted metrics. |
two_mode |
Logical (default = TRUE), indicating whether the input matrix is two mode (i.e. a document-term matrix) or one-mode (i.e. document-similarity matrix) |
If a document-term matrix is provided, the function obtains the one-mode
document-level projection to get the document-similarity matrix using
tcrossprod()
. If a one-mode document-similarity matrix is provided, then
this step is skipped. This way document similiarities may be obtained
using other methods, such as Word-Mover's Distance (see doc_similarity
).
The diagonal is ignored in all calculations.
Document centrality methods include:
degree: Opsahl's weighted degree centrality with tuning parameter "alpha"
between: vertex betweenness centrality using Brandes' method
eigen: eigenvector centrality using Freeman's method
span: Modified Burt's constraint following Stoltz and Taylor's method, uses a tuning parameter "alpha" and the output is scaled.
A dataframe with two columns
Dustin Stoltz
Brandes, Ulrik
(2000) 'A faster algorithm for betweenness centrality'
Journal of Mathematical Sociology. 25(2):163-177
doi:10.1080/0022250X.2001.9990249.
Opsahl, Tore, et al.
(2010) 'Node centrality in weighted networks: Generalizing degree
and shortest paths.' Social Networks. 32(3)245:251
doi:10.1016/j.socnet.2010.03.006
Stoltz, Dustin; Taylor, Marshall
(2019) 'Textual Spanning: Finding Discursive Holes in Text Networks'
Socius. doi:10.1177/2378023119827674
# load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) ddeg <- doc_centrality(dtm, method = "degree") deig <- doc_centrality(dtm, method = "eigen") dbet <- doc_centrality(dtm, method = "between") dspa <- doc_centrality(dtm, method = "span") # with a document-similarity matrix (dsm) dsm <- doc_similarity(dtm, method = "cosine") ddeg <- doc_centrality(dsm, method = "degree", two_mode = FALSE)
# load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) ddeg <- doc_centrality(dtm, method = "degree") deig <- doc_centrality(dtm, method = "eigen") dbet <- doc_centrality(dtm, method = "between") dspa <- doc_centrality(dtm, method = "span") # with a document-similarity matrix (dsm) dsm <- doc_similarity(dtm, method = "cosine") ddeg <- doc_centrality(dsm, method = "degree", two_mode = FALSE)
Given a document-term matrix (DTM) this function returns the similarities between documents using a specified method (see details). The result is a square document-by-document similarity matrix (DSM), equivalent to a weighted adjacency matrix in network analysis.
doc_similarity(x, y = NULL, method, wv = NULL)
doc_similarity(x, y = NULL, method, wv = NULL)
x |
Document-term matrix with terms as columns. |
y |
Optional second matrix (default = |
method |
Character vector indicating similarity method, including projection, cosine, wmd, and centroid (see Details). |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. Required for "wmd" and "centroid" similarities. |
Document similarity methods include:
projection: finds the one-mode projection matrix from the two-mode DTM
using tcrossprod()
which measures the shared vocabulary overlap
cosine: compares row vectors using cosine similarity
jaccard: compares proportion of common words to unique words in both documents
wmd: word mover's distance to compare documents (requires word embedding vectors), using linear-complexity relaxed word mover's distance
centroid: represents each document as a centroid of their respective vocabulary, then uses cosine similarity to compare centroid vectors (requires word embedding vectors)
Dustin Stoltz
# load example word embeddings data(ft_wv_sample) # load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) dsm_prj <- doc_similarity(dtm, method = "projection") dsm_cos <- doc_similarity(dtm, method = "cosine") dsm_wmd <- doc_similarity(dtm, method = "wmd", wv = ft_wv_sample) dsm_cen <- doc_similarity(dtm, method = "centroid", wv = ft_wv_sample)
# load example word embeddings data(ft_wv_sample) # load example text data(jfk_speech) # minimal preprocessing jfk_speech$sentence <- tolower(jfk_speech$sentence) jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence) # create DTM dtm <- dtm_builder(jfk_speech, sentence, sentence_id) dsm_prj <- doc_similarity(dtm, method = "projection") dsm_cos <- doc_similarity(dtm, method = "cosine") dsm_wmd <- doc_similarity(dtm, method = "wmd", wv = ft_wv_sample) dsm_cen <- doc_similarity(dtm, method = "centroid", wv = ft_wv_sample)
A streamlined function to take raw texts from a column of a data.frame and produce a sparse Document-Term Matrix (of generic class "dgCMatrix").
dtm_builder( data, text, doc_id = NULL, vocab = NULL, chunk = NULL, dense = FALSE, omit_empty = FALSE )
dtm_builder( data, text, doc_id = NULL, vocab = NULL, chunk = NULL, dense = FALSE, omit_empty = FALSE )
data |
Data.frame with column of texts and column of document ids |
text |
Name of the column with documents' text |
doc_id |
Name of the column with documents' unique ids. |
vocab |
Default is |
chunk |
Default is |
dense |
The default ( |
omit_empty |
Logical (default = |
The function is fast because it has few bells and whistles:
No weighting schemes other than raw counts
Tokenizes by the fixed, single whitespace
Only tokenizes unigrams. No bigrams, trigrams, etc...
Columns are in the order unique terms are discovered
No preprocessing during building
Outputs a basic sparse Matrix or dense matrix
Weighting or stopping terms can be done efficiently after the fact with
simple matrix operations, rather than achieved implicitly within the
function itself. For example, using the dtm_stopper()
function.
Prior to creating the DTM, texts should have whitespace trimmed, if
desired, punctuation removed and terms lowercased.
Like tidytext
's DTM functions, dtm_builder()
is optimized for use
in a pipeline, but unlike tidytext
, it does not build an intermediary
tripletlist, so dtm_builder()
is faster and far more memory
efficient.
The function can also chunk
the corpus into documents of a given length
(default is NULL
). If the integer provided is 200L
, this will divide
the corpus into new documents with 200 terms (with the final document
likely including slightly less than 200). If the total terms in the
corpus were less than or equal to chunk
integer, this would produce
a DTM with one document (most will probably not want this).
If the vocabulary is already known, or standardizing vocabulary across
several DTMs is desired, a list of terms can be provided to the vocab
argument. Columns of the DTM will be in the order of the list of terms.
returns a document-term matrix of class "dgCMatrix" or class "matrix"
Dustin Stoltz
library(dplyr) my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) # example 1 with R 4.1 pipe dtm <- my_corpus |> dtm_builder(clean_text, line_id) # example 2 without pipe dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) # example 3 with dplyr pipe and mutate dtm <- my_corpus %>% mutate( clean_text = gsub("'", "", text), clean_text = tolower(clean_text) ) %>% dtm_builder(clean_text, line_id) # example 4 with dplyr and chunk of 3 terms dtm <- my_corpus %>% dtm_builder(clean_text, line_id, chunk = 3L ) # example 5 with user defined vocabulary my.vocab <- c("wonderful", "world", "haiku", "think") dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id, vocab = my.vocab )
library(dplyr) my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) # example 1 with R 4.1 pipe dtm <- my_corpus |> dtm_builder(clean_text, line_id) # example 2 without pipe dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) # example 3 with dplyr pipe and mutate dtm <- my_corpus %>% mutate( clean_text = gsub("'", "", text), clean_text = tolower(clean_text) ) %>% dtm_builder(clean_text, line_id) # example 4 with dplyr and chunk of 3 terms dtm <- my_corpus %>% dtm_builder(clean_text, line_id, chunk = 3L ) # example 5 with user defined vocabulary my.vocab <- c("wonderful", "world", "haiku", "think") dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id, vocab = my.vocab )
Converts a DTM into a data frame with three columns:
documents, terms, frequency. Each row is a unique
document by term frequency. This is akin to reshape2
packages melt
function, but works on a sparse matrix.
The resulting data frame is also equivalent to the
tidytext
triplet tibble.
dtm_melter(dtm)
dtm_melter(dtm)
dtm |
Document-term matrix with terms as columns. Works with DTMs
produced by any popular text analysis package, or using the
|
returns data frame with three columns: doc_id, term, freq
Dustin Stoltz
Takes any DTM and randomly resamples from each row, creating a new DTM
dtm_resampler(dtm, alpha = NULL, n = NULL)
dtm_resampler(dtm, alpha = NULL, n = NULL)
dtm |
Document-term matrix with terms as columns. Works with DTMs
produced by any popular text analysis package, or you can use the
|
alpha |
Number indicating proportion of document lengths, e.g.,
|
n |
Integer indicating the length of documents to be returned, e.g.,
|
Using the row counts as probabilities, each document's tokens are resampled with replacement up to a certain proportion of the row count (set by alpha). This function can be used with iteration to "bootstrap" a DTM without returning to the raw text. It does not iterate, however, so operations can be performed on one DTM at a time without storing multiple DTMs in memory.
If alpha
is less than 1, then a proportion of each documents' lengths is
returned. For example, alpha = 0.50
will return a resampled DTM where each
row has half the tokens of the original DTM. If alpha = 2
, than each row in
the resampled DTM twice the number of tokens of the original DTM.
If an integer is provided to n
then all documents will be resampled to that
length. For example, n = 2000L
will resample each document until they are
2000 tokens long – meaning those shorter than 2000 will be increased in
length, while those longer than 2000 will be decreased in length. alpha
and n
should not be specified at the same time.
returns a document-term matrix of class "dgCMatrix"
dtm_stats()
provides a summary of corpus-level statistics
using any document-term matrix. These include (1) basic information
on size (total documents, total unique terms, total tokens),
(2) lexical richness, (3) distribution information,
(4) central tendency, and (5) character-level information.
dtm_stats( dtm, richness = TRUE, distribution = TRUE, central = TRUE, character = TRUE, simplify = FALSE )
dtm_stats( dtm, richness = TRUE, distribution = TRUE, central = TRUE, character = TRUE, simplify = FALSE )
dtm |
Document-term matrix with terms as columns. Works with DTMs
produced by any popular text analysis package, or you can use the
|
richness |
Logical (default = TRUE), whether to include statistics about lexical richness, i.e. terms that occur once, twice, and three times (hapax, dis, tris), and the total type-token ratio. |
distribution |
Logical (default = TRUE), whether to include statistics about the distribution, i.e. min, max st. dev, skewness, kurtosis. |
central |
Logical (default = TRUE), whether to include statistics about the central tendencies i.e. mean and median for types and tokens. |
character |
Logical (default = TRUE), whether to include statistics about the character lengths of terms, i.e. min, max, mean |
simplify |
Logical (default = FALSE), whether to return statistics as a data frame where each statistic is a column. Default returns a list of small data frames. |
A list of one to five data frames with summary statistics (if
simplify=FALSE
), otherwise a single data frame where each
statistic is a column.
Dustin Stoltz
dtm_stopper
will "stop" terms from the analysis by removing columns in a
DTM based on stop rules. Rules include matching terms in a precompiled or
custom list, terms meeting an upper or lower document frequency threshold,
or terms meeting an upper or lower term frequency threshold.
dtm_stopper( dtm, stop_list = NULL, stop_termfreq = NULL, stop_termrank = NULL, stop_termprop = NULL, stop_docfreq = NULL, stop_docprop = NULL, stop_hapax = FALSE, stop_null = FALSE, omit_empty = FALSE, dense = FALSE, ignore_case = TRUE )
dtm_stopper( dtm, stop_list = NULL, stop_termfreq = NULL, stop_termrank = NULL, stop_termprop = NULL, stop_docfreq = NULL, stop_docprop = NULL, stop_hapax = FALSE, stop_null = FALSE, omit_empty = FALSE, dense = FALSE, ignore_case = TRUE )
dtm |
Document-term matrix with terms as columns. Works with DTMs
produced by any popular text analysis package, or you can use the
|
stop_list |
Vector of terms, from a precompiled stoplist or
custom list such as |
stop_termfreq |
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use |
stop_termrank |
Single integer indicating upper term rank threshold for exclusion (see details). |
stop_termprop |
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use |
stop_docfreq |
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use |
stop_docprop |
Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use |
stop_hapax |
Logical (default = FALSE) indicating whether to remove terms occurring one time (or zero times), a.k.a. hapax legomena |
stop_null |
Logical (default = FALSE) indicating whether to remove terms that occur zero times in the DTM. |
omit_empty |
Logical (default = FALSE) indicating whether to omit rows that are empty after stopping any terms. |
dense |
The default ( |
ignore_case |
Logical (default = TRUE) indicating whether to ignore capitalization. |
Stopping terms by removing their respective columns in the DTM is
significantly more efficient than searching raw text with string matching
and deletion rules. Behind the scenes, the function relies on
the fastmatch
package to quickly match/not-match terms.
The stop_list
arguments takes a list of terms which are matched and
removed from the DTM. If ignore_case = TRUE
(the default) then word
case will be ignored.
The stop_termfreq
argument provides rules based on a term's occurrences
in the DTM as a whole – regardless of its within document frequency. If
real numbers between 0 and 1 are provided then terms will be removed by
corpus proportion. For example c(0.01, 0.99)
, terms that are either below
1% of the total tokens or above 99% of the total tokens will be removed. If
integers are provided then terms will be removed by total count. For example
c(100, 9000)
, occurring less than 100 or more than 9000 times in the
corpus will be removed. This also means that if c(0, 1)
is provided, then
the will only keep terms occurring once.
The stop_termrank
argument provides the upper threshold for a terms' rank
in the corpus. For example, 5L
will remove the five most frequent terms.
The stop_docfreq
argument provides rules based on a term's document
frequency – i.e. the number of documents within which it occurs, regardless
of how many times it occurs. If real numbers between 0 and 1 are provided
then terms will be removed by corpus proportion. For example c(0.01, 0.99)
,
terms in more than 99% of all documents or terms that are in less than 1% of
all documents. For example c(100, 9000)
, then words occurring in less than
100 documents or more than 9000 documents will be removed. This means that if
c(0, 1)
is provided, then the function will only keep terms occurring in
exactly one document, and remove terms in more than one.
The stop_hapax
argument is a shortcut for removing terms occurring just one
time in the corpus – called hapax legomena. Typically, a size-able portion
of the corpus tends to be hapax terms, and removing them is a quick solution
to reducing the dimensions of a DTM. The DTM must be frequency counts (not
relative frequencies).
The stop_null
argument removes terms that do not occur at all.
In other words, there is a column for the term, but the entire column
is zero. This can occur for a variety of reasons, such as starting with
a predefined vocabulary (e.g., using dtm_builder's vocab
argument) or
through some cleaning processes.
The omit_empty
argument will remove documents that are empty
returns a document-term matrix of class "dgCMatrix"
Dustin Stoltz
# create corpus and DTM my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) ## example 1 with R 4.1 pipe dtm_st <- dtm |> dtm_stopper(stop_list = c("world", "babies")) ## example 2 without pipe dtm_st <- dtm_stopper( dtm, stop_list = c("world", "babies") ) ## example 3 precompiled stoplist dtm_st <- dtm_stopper( dtm, stop_list = get_stoplist("snowball2014") ) ## example 4, stop top 2 dtm_st <- dtm_stopper( dtm, stop_termrank = 2L ) ## example 5, stop docfreq dtm_st <- dtm_stopper( dtm, stop_docfreq = c(2, 5) )
# create corpus and DTM my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) ## example 1 with R 4.1 pipe dtm_st <- dtm |> dtm_stopper(stop_list = c("world", "babies")) ## example 2 without pipe dtm_st <- dtm_stopper( dtm, stop_list = c("world", "babies") ) ## example 3 precompiled stoplist dtm_st <- dtm_stopper( dtm, stop_list = get_stoplist("snowball2014") ) ## example 4, stop top 2 dtm_st <- dtm_stopper( dtm, stop_termrank = 2L ) ## example 5, stop docfreq dtm_st <- dtm_stopper( dtm, stop_docfreq = c(2, 5) )
"Project" each word in a word embedding matrix of dimension along a
vector of
dimensions, extracted from the same embedding space.
The vector can be a single word, or a concept vector obtained from
get_centroid()
, get_direction()
, or get_regions()
.
find_projection(wv, vec)
find_projection(wv, vec)
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
vec |
Vector extracted from the embeddings |
All the vectors in the matrix are projected onto the a vector,
, to find the projection matrix,
, defined as:
A new word embedding matrix, each row of which is parallel to vector.
"Reject" each word in a word embedding matrix of dimension
from a vector of
dimensions, extracted from the same
embedding space. The vector can be a single word, or a concept
vector obtained from
get_centroid()
, get_direction()
,
or get_regions()
.
find_rejection(wv, vec)
find_rejection(wv, vec)
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
vec |
Vector extracted from the embeddings |
A new word embedding matrix, each row of which is rejected from vector.
Given a matrix, , of word embedding vectors (source) with
terms as rows, this function finds a transformed matrix following a
specified operation. These include: centering (i.e.
translation) and normalization (i.e. scaling). In the first,
is
centered by subtracting column means. In the second,
is
normalized by the L2 norm. Both have been found to improve
word embedding representations. The function also finds a transformed
matrix that approximately aligns
, with another matrix,
, of word embedding vectors (reference), using Procrustes
transformation (see details). Finally, given a term-co-occurrence matrix
built on a local corpus, the function can "retrofit" pretrained
embeddings to better match the local corpus.
find_transformation( wv, ref = NULL, method = c("align", "norm", "center", "retrofit") )
find_transformation( wv, ref = NULL, method = c("align", "norm", "center", "retrofit") )
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as terms (the source matrix to be transformed). |
ref |
If |
method |
Character vector indicating the method to use for the transformation. Current methods include: "align", "norm", "center", and "refrofit" – see details. |
Aligning a source matrix of word embedding vectors, , to a
reference matrix,
, has primarily been used as a post-processing step
for embeddings trained on longitudinal corpora for diachronic analysis
or for cross-lingual embeddings. Aligning preserves internal (cosine)
distances, while orient the source embeddings to minimize the sum of squared
distances (and is therefore a Least Squares problem).
Alignment is accomplished with the following steps:
translation: centering by column means
scaling: scale (normalizes) by the L2 Norm
rotation/reflection: rotates and a reflects to minimize sum of squared differences, using singular value decomposition
Alignment is asymmetrical, and only outputs the transformed source matrix,
. Therefore, it is typically recommended to align
to
,
and then
to
. However, simplying centering and norming
after may be sufficient.
A new word embedding matrix, transformed using the specified method.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. (2018).
'A robust self-learning method for fully unsupervised
cross-lingual mappings of word embeddings.' In Proceedings
of the 56th Annual Meeting of the Association for
Computational Linguistics. 789-798
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019.
'An effective approach to unsupervised machine translation.'
In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics. 194-203
Hamilton, William L., Jure Leskovec, and Dan Jurafsky. (2018).
'Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change.'
https://arxiv.org/abs/1605.09096v6.
Lin, Zefeng, Xiaojun Wan, and Zongming Guo. (2019).
'Learning Diachronic Word Embeddings with Iterative Stable
Information Alignment.' Natural Language Processing and
Chinese Computing. 749-60. doi:10.1007/978-3-030-32233-5_58.
Schlechtweg et al. (2019). 'A Wind of Change: Detecting and
Evaluating Lexical Semantic Change across Times and Domains.'
https://arxiv.org/abs/1906.02979v1.
Shoemark et a. (2019). 'Room to Glo: A Systematic Comparison
of Semantic Change Detection Approaches with Word Embeddings.'
Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing. 66-76. doi:10.18653/v1/D19-1007
Borg and Groenen. (1997). Modern Multidimensional Scaling.
New York: Springer. 340-342
These are a sample of the English fastText embeddings
including 770 words matching those used in the jfk_speech
.
These are intended to be used for example code.
ft_wv_sample
ft_wv_sample
A matrix of 770 rows and 300 columns
Produces a data.frame of juxtaposed word pairs used to extract
a semantic direction from word embeddings. Can be used as input
to get_direction()
.
get_anchors(relation)
get_anchors(relation)
relation |
String indicating a semantic relation, 26 relations are available in the dataset (see details). |
Sets of juxtaposed "anchor" pairs are adapted from published work and associated with a particular semantic relation. These should be used as a starting point, not as a "ground truth."
Available relations include:
activity
affluence
age
attractiveness
borders
concreteness
cultivation
dominance
education
gender
government
purity
safety
sexuality
skills
status
valence
whiteness
returns a tibble with two columns
Dustin Stoltz
gen <- get_anchors(relation = "gender")
gen <- get_anchors(relation = "gender")
The function outputs an averaged vector from a set of anchor terms' word
vectors. This average is roughly equivalent to the intersection of the
contexts in which each word is used. This semantic centroid can be used
for a variety of ends, and specifically as input to CMDist()
.
get_centroid()
requires a list of terms, string of terms, data.frame
or matrix. In the latter two cases, the first column will be used. The
vectors are aggregated using the simple average. Terms can be repeated,
and are therefore "weighted" by their counts.
get_centroid(anchors, wv, missing = "stop")
get_centroid(anchors, wv, missing = "stop")
anchors |
List of terms to be averaged |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
missing |
what action to take if terms are not in embeddings. If action = "stop" (default), the function is stopped and an error messages states which terms are missing. If action = "remove", missing terms or rows with missing terms are removed. Missing terms will be printed as a message. |
returns a one row matrix
Dustin Stoltz
# load example word embeddings data(ft_wv_sample) space1 <- c("spacecraft", "rocket", "moon") cen1 <- get_centroid(anchors = space1, wv = ft_wv_sample) space2 <- c("spacecraft rocket moon") cen2 <- get_centroid(anchors = space2, wv = ft_wv_sample) identical(cen1, cen2)
# load example word embeddings data(ft_wv_sample) space1 <- c("spacecraft", "rocket", "moon") cen1 <- get_centroid(anchors = space1, wv = ft_wv_sample) space2 <- c("spacecraft rocket moon") cen2 <- get_centroid(anchors = space2, wv = ft_wv_sample) identical(cen1, cen2)
get_direction()
outputs a vector corresponding to one pole of a
"semantic direction" built from sets of antonyms or juxtaposed terms.
The output can be used as an input to CMDist()
and CoCA()
. Anchors
must be a two-column data.frame or a list of length == 2.
get_direction(anchors, wv, method = "paired", missing = "stop", n_dirs = 1L)
get_direction(anchors, wv, method = "paired", missing = "stop", n_dirs = 1L)
anchors |
A data frame or list of juxtaposed 'anchor' terms |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as terms. |
method |
Indicates the method used to generate vector offset. Default is 'paired'. See details. |
missing |
what action to take if terms are not in embeddings. If action = "stop" (default), the function is stopped and an error messages states which terms are missing. If action = "remove", missing terms or rows with missing terms are removed. Missing terms will be printed as a message. |
n_dirs |
If |
Semantic directions can be estimated in using a few methods:
'paired' (default): each individual term is subtracted from exactly one other paired term. there must be the same number of terms for each side of the direction (although one word may be used more than once).
'pooled': terms corresponding to one side of a direction are first averaged, and then these averaged vectors are subtracted. A different number of terms can be used for each side of the direction.
'L2': the vector is calculated the same as with 'pooled' but is then divided by the L2 'Euclidean' norm
'PCA': vector offsets are calculated for each pair of terms,
as with 'paired', and if n_dirs = 1L
(the default)
then the direction is the first principal component.
Users can return more than one direction by increasing
the n_dirs
parameter.
returns a one row matrix
Dustin Stoltz
Bolukbasi, T., Chang, K. W., Zou, J., Saligrama, V., and Kalai, A. (2016).
Quantifying and reducing stereotypes in word embeddings. arXiv preprint
https://arxiv.org/abs/1606.06121v1.
Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama,
Adam Kalai (2016). 'Man Is to Computer Programmer as Woman Is to Homemaker?
Debiasing Word Embeddings.' Proceedings of the 30th International Conference
on Neural Information Processing Systems. 4356-4364.
https://dl.acm.org/doi/10.5555/3157382.3157584.
Taylor, Marshall A., and Dustin S. Stoltz. (2020)
'Concept Class Analysis: A Method for Identifying Cultural
Schemas in Texts.' Sociological Science 7:544-569.
doi:10.15195/v7.a23.
Taylor, Marshall A., and Dustin S. Stoltz. (2020) 'Integrating semantic
directions with concept mover's distance to measure binary concept
engagement.' Journal of Computational Social Science 1-12.
doi:10.1007/s42001-020-00075-8.
Kozlowski, Austin C., Matt Taddy, and James A. Evans. (2019). 'The geometry
of culture: Analyzing the meanings of class through word embeddings.'
American Sociological Review 84(5):905-949.
doi:10.1177/0003122419877135.
Arseniev-Koehler, Alina, and Jacob G. Foster. (2020). 'Machine learning
as a model for cultural learning: Teaching an algorithm what it means to
be fat.' arXiv preprint https://arxiv.org/abs/2003.12133v2.
# load example word embeddings data(ft_wv_sample) # create anchor list gen <- data.frame( add = c("woman"), subtract = c("man") ) dir <- get_direction(anchors = gen, wv = ft_wv_sample) dir <- get_direction( anchors = gen, wv = ft_wv_sample, method = "PCA", n = 1L )
# load example word embeddings data(ft_wv_sample) # create anchor list gen <- data.frame( add = c("woman"), subtract = c("man") ) dir <- get_direction(anchors = gen, wv = ft_wv_sample) dir <- get_direction( anchors = gen, wv = ft_wv_sample, method = "PCA", n = 1L )
Given a set of word embeddings of dimensions and
vocabulary,
get_regions()
finds semantic regions in
dimensions.
This, in effect, learns latent topics from an embedding space (a.k.a.
topic modeling), which are directly comparable to both terms (with
cosine similarity) and documents (with Concept Mover's distance
using
CMDist()
).
get_regions(wv, k_regions = 5L, max_iter = 20L, seed = 0)
get_regions(wv, k_regions = 5L, max_iter = 20L, seed = 0)
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. |
k_regions |
Integer indicating the k number of regions to return |
max_iter |
Integer indicating the maximum number of iterations before k-means terminates. |
seed |
Integer indicating a random seed. Default is 0, which calls 'std::time(NULL)'. |
To group words into more encompassing "semantic regions" we use -means
clustering. We choose
-means primarily for it's ubiquity and the wide
range of available diagnostic tools for
-means cluster.
A word embedding matrix of dimensions and
vocabulary is
"clustered" into
semantic regions which have
dimensions.
Each region is represented by a single point defined by the
dimensional vector. The process discretely assigns all word vectors are
assigned to a given region so as to minimize some error function, however
as the resulting regions are in the same dimensions as the word embeddings,
we can measure each terms similarity to each region. This, in effect,
is a mixed membership topic model similar to topic modeling by Latent
Dirichlet Allocation.
We use the KMeans_arma
function from the ClusterR
package which
uses the Armadillo library.
returns a matrix of class "dgCMatrix" with k rows and d dimensions
Dustin Stoltz
Butnaru, Andrei M., and Radu Tudor Ionescu. (2017)
'From image to text classification: A novel approach
based on clustering word embeddings.'
Procedia computer science. 112:1783-1792.
doi:10.1016/j.procs.2017.08.211.
Zhang, Yi, Jie Lu, Feng Liu, Qian Liu, Alan Porter,
Hongshu Chen, and Guangquan Zhang. (2018).
'Does Deep Learning Help Topic Extraction? A Kernel
K-Means Clustering Method with Word Embedding.'
Journal of Informetrics. 12(4):1099-1117.
doi:10.1016/j.joi.2018.09.004.
Arseniev-Koehler, Alina and Cochran, Susan D and
Mays, Vickie M and Chang, Kai-Wei and Foster,
Jacob Gates (2021) 'Integrating topic modeling
and word embedding to characterize violent deaths'
doi:10.31235/osf.io/nkyaq
# load example word embeddings data(ft_wv_sample) my.regions <- get_regions( wv = ft_wv_sample, k_regions = 10L, max_iter = 10L, seed = 01984 )
# load example word embeddings data(ft_wv_sample) my.regions <- get_regions( wv = ft_wv_sample, k_regions = 10L, max_iter = 10L, seed = 01984 )
Provides access to 8 precompiled stoplists, including the most commonly used
stoplist from the Snowball stemming package ("snowball2014"), text2map
's
tiny stoplist ("tiny2020"), a few historically important stop lists. This
aims to be a transparent and well-document collection of stoplists. Only
includes English language stoplists at the moment.
get_stoplist(source = "tiny2020", language = "en", tidy = FALSE)
get_stoplist(source = "tiny2020", language = "en", tidy = FALSE)
source |
Character indicating source, default = |
language |
Character (default = "en") indicating language of stopwords by ISO 639-1 code, currently only English is supported. |
tidy |
logical (default = |
There is no such thing as a stopword! But, there are tons of
precompiled lists of words that someone thinks we should remove from
our texts. (See for example: https://github.com/igorbrigadir/stopwords)
One of the first stoplists is from C.J. van Rijsbergen's "Information
retrieval: theory and practice" (1979) and includes 250 words.
text2map
's very own stoplist tiny2020
is a lean 34 words.
Below are stoplists available with get_stoplist:
"tiny2020": Tiny (2020) list of 33 words (Default)
"snowball2001": Snowball stemming package's (2001) list of 127 words
"snowball2014": Updated Snowball (2014) list of 175 words
"van1979": C. J. van Rijsbergen's (1979) list of 250 words
"fox1990": Christopher Fox's (1990) list of 421 words
"smart1993": Original SMART (1993) list of 570 words
"onix2000": ONIX (2000) list of 196 words
"nltk2001": Python's NLTK (2009) list of 179 words
The Snowball (2014) stoplist is likely the most commonly, it is the default
in the stopwords
package, which is used by quanteda
, tidytext
and
tokenizers
packages, followed closely by the Smart (1993) stoplist,
the default in the tm
package. The word counts for SMART (1993) and
ONIX (2000) are slightly different than in other places because of
duplicate words.
Character vector of words to be stopped, if tidy = TRUE, a tibble is returned
Dustin Stoltz
This is a data frame for the text of JFK's Rice Speech "We choose to go to the moon." Each row is a 10 word string of the speech – roughly a sentence. This is intended to be used for example code.
jfk_speech
jfk_speech
A data frame with 2 columns
Variables:
sentence_id. Order and unique ID for the sentence
sentence. The text of a sentence
Metadata related to Shakespeare's First Folio including the IDs to download the plays from Project Gutenberg, and a count of the number of deaths in each play (body count).
meta_shakespeare
meta_shakespeare
A matrix of 37 rows and 8 columns
Variables:
short_title.
gutenberg_title.
gutenberg_id.
genre.
year.
body_count.
boas_problem_plays.
death.
perm_tester()
carries out Monte Carlo permutation tests for model
p-values from two-tailed, left-tailed, and/or right-tailed hypothesis
testing.
perm_tester( data, model, perm_var = NULL, strat_var = NULL, statistic, perm_n = 1000, alternative = "all", alpha = 0.05, seed = NULL )
perm_tester( data, model, perm_var = NULL, strat_var = NULL, statistic, perm_n = 1000, alternative = "all", alpha = 0.05, seed = NULL )
data |
The dataframe from which the model is estimated. |
model |
The model which will be estimated and re-estimated. |
perm_var |
The variable in the model that will be permuted.
Default is |
strat_var |
Categorical variable for within-stratum permutations.
Defaults to |
statistic |
The name of the model statistic you want to "grab" after re-running the model with each permutation to compare to the original model statistic. |
perm_n |
The total number of permutations. Defaults to 1000. |
alternative |
The alternative hypothesis. One of |
alpha |
Alpha level for the hypothesis test. Defaults to 0.05. |
seed |
Optional seed for reproducibility of the p-value statistics. Defaults to null. |
perm_tester()
can be used to derive p-values under the randomization
model of inference. There are various reasons one might want to do this—
with text data, and observational data more generally, this might be
because the corpus/sample is not a random sample from a target population.
In such cases, population model p-values might not make much sense since
the asymptotically-derived standard errors from which they are constructed
themselves do not make sense. We might therefore want to make inferences
on the basis of whether or not randomness, as a data-generating mechanism,
might reasonably account for a statistic at least as extreme as the one
we observed. perm_tester()
works from this idea.
perm_tester()
works like this. First, the model (supplied the model
parameter) is run on the observed data. Second, we take some statistic of
interest, which we indicate with the statistic
parameter, and set it to
the side. Third, a variable, perm_var
, is permuted—meaning the observed
values for the rows of data
on perm_var
are randomly reshuffled. Fourth,
we estimate the model again, this time with the permuted perm_var
. Fifth,
we get grab that same statistic
. We repeat steps two through
five a total of perm_n
times, each time tallying the number of times the
statistic
from the permutation-derived model is greater than or equal to
(for a right-tailed test), less-than or equal to (for a left-tailed test),
and/or has an absolute value greater than or equal to (for a two-tailed test)
the statistic
from the "real" model.
If we divide those tallies by the total number of permutations, then we
get randomization-based p-values. This is what perm_tester()
does. The
null hypothesis is that randomness could likely generate the statistic
that we observe. The alternative hypothesis is that randomness alone likely
can't account for the observed statistic.
We then reject the null hypothesis if the p-value is below a threshold indicated
with alpha
, which, as in population-based inference, is the probability
below which we are willing to reject the null hypothesis when it is actually
true. So if the p-value is below, say, alpha
= 0.05 and we're performing,
a right-tailed test, then fewer than 5% of the statistics derived from the
permutation-based models are greater than or equal to our observed
statistic. We would then reject the null, as it is unlikely (based on our alpha
threshold), that randomness as a data-generating mechanism can account
for a test statistic at least as large the one we observed.
In most cases, analysts probably cannot expect to perform "exact" permutation
tests where every possible permutation is accounted for—i.e., where
perm_n
equals the total number of possible permutations. Instead, we
can take random samples of the "population" of permutations. perm_tester()
does this, and reports the standard errors and (1 - alpha
) confidence
intervals for the p-values.
perm_tester()
can also perform stratified permutation tests, where the observed
perm_var
variables within groups. This can be done by setting the strat_var
variable to be he grouping variable.
Returns a data frame with the observed statistic (stat
), the
p-values (P_left
, for left-tailed, P_right
for right-tailed,
and/or
P_two
for two-tailed), and the standard errors and confidence
intervals for those p-values, respectively.
Marshall Taylor and Dustin Stoltz
Taylor, Marshall A. (2020)
'Visualization Strategies for Regression Estimates with Randomization
Inference' Stata Journal 20(2):309-335.
doi:10.1177/1536867X20930999.
#' Darlington, Richard B. and Andrew F. Hayes (2016)
Regression analysis and linear models: Concepts, applications, and implementation.
Guilford Publications.
Ernst, Michael D. (2004)
'permutation methods: a basis for exact inference' Statistical Scicence
19(4):676-685.
doi:10.1214/088342304000000396.
Manly, Bryan F. J. (2007)
Randomization, Bootstrap and Monte Carlo Methods in Biology.
Chapman and Hall/CRC.
doi:10.1201/9781315273075.
data <- text2map::meta_shakespeare model <- lm(body_count ~ boas_problem_plays + year + genre, data = data) # without stratified permutations, two-sided test out1 <- perm_tester( data = data, model = model, statistic = "coefficients", perm_n = 40, alternative = "two.sided", alpha = .01, seed = 8675309 ) # with stratified permutations, two-sided test out2 <- perm_tester( data = data, model = model, strat_var = "boas_problem_plays", statistic = "coefficients", perm_n = 40, alternative = "two.sided", alpha = .01, seed = 8675309 )
data <- text2map::meta_shakespeare model <- lm(body_count ~ boas_problem_plays + year + genre, data = data) # without stratified permutations, two-sided test out1 <- perm_tester( data = data, model = model, statistic = "coefficients", perm_n = 40, alternative = "two.sided", alpha = .01, seed = 8675309 ) # with stratified permutations, two-sided test out2 <- perm_tester( data = data, model = model, strat_var = "boas_problem_plays", statistic = "coefficients", perm_n = 40, alternative = "two.sided", alpha = .01, seed = 8675309 )
Plot CoCA
## S3 method for class 'CoCA' plot( x, module = NULL, cutoff = 0.05, repulse = 1.86, min = 0.15, max = 1, main = NULL, ... )
## S3 method for class 'CoCA' plot( x, module = NULL, cutoff = 0.05, repulse = 1.86, min = 0.15, max = 1, main = NULL, ... )
x |
CoCA object returned by |
module |
index for which module to plot (default = NULL) |
cutoff |
minimum absolute value of correlations to plot |
repulse |
repulse radius in the spring layout |
min |
edges with absolute weights under this value are not shown (default = 0.15) |
max |
highest weight to scale the edge widths too (default = 1) |
main |
title for plot (default = NULL) |
... |
Arguments to be passed to methods |
returns qgraph
object
Prints CoCA class information
## S3 method for class 'CoCA' print(x, ...)
## S3 method for class 'CoCA' print(x, ...)
x |
CoCA object returned by |
... |
Arguments to be passed to methods |
prints a message indicating the classes and sizes
rancor_builder()
generates a random corpus (rancor) based on a user
defined term probabilities and vocabulary. Users can set the number of
documents, as well as the mean, standard deviation, minimum, and maximum
document lengths (i.e., number of tokens) of the parent normal distribution
from which the document lengths are randomly sampled. The output is a single
document-term matrix. To produce multiple random corpora, use
rancors_builder()
(note the plural). Term probabilities/vocabulary can
come from a users own corpus, or a pre-compiled frequency list, such
as the one derived from the Google Book N-grams corpus
rancor_builder( data, vocab, probs, n_docs = 100L, len_mean = 500, len_var = 10L, len_min = 20L, len_max = 1000L, seed = NULL )
rancor_builder( data, vocab, probs, n_docs = 100L, len_mean = 500, len_var = 10L, len_min = 20L, len_max = 1000L, seed = NULL )
data |
Data.frame containing vocabulary and probabilities |
vocab |
Name of the column containing vocabulary |
probs |
Name of the column containing probabilities |
n_docs |
Integer indicating the number of documents to be returned |
len_mean |
Integer indicating the mean of the document lengths in the parent normal sampling distribution |
len_var |
Integer indicating the standard deviation of the document lengths in the parent normal sampling distribution |
len_min |
Integer indicating the minimum of the document lengths in the parent normal sampling distribution |
len_max |
Integer indicating the maximum of the document lengths in the parent normal sampling distribution |
seed |
Optional seed for reproducibility |
Dustin Stoltz and Marshall Taylor
# create corpus and DTM my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) # use colSums to get term frequencies df <- data.frame( terms = colnames(dtm), freqs = colSums(dtm) ) # convert to probabilities df$probs <- df$freqs / sum(df$freqs) # create random DTM rDTM <- df |> rancor_builder(terms, probs)
# create corpus and DTM my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) # use colSums to get term frequencies df <- data.frame( terms = colnames(dtm), freqs = colSums(dtm) ) # convert to probabilities df$probs <- df$freqs / sum(df$freqs) # create random DTM rDTM <- df |> rancor_builder(terms, probs)
rancors_builder()
generates multiple random corpus (rancor) based on a user
defined term probabilities and vocabulary. sers can set the number of
documents, as well as the mean, standard deviation, minimum, and maximum
document lengths (i.e., number of tokens) of the parent normal distribution
from which the document lengths are randomly sampled. The output is a list of
document-term matrices. To produce a single random corpus, use
rancor_builder()
(note the singular).
rancors_builder( data, vocab, probs, n_cors, n_docs, len_mean, len_var, len_min, len_max, seed = NULL )
rancors_builder( data, vocab, probs, n_cors, n_docs, len_mean, len_var, len_min, len_max, seed = NULL )
data |
Data.frame containing vocabulary and probabilities |
vocab |
Name of the column containing vocabulary |
probs |
Name of the column containing probabilities |
n_cors |
Integer indicating the number of corpora to build |
n_docs |
Integer(s) indicating the number of documents to be returned If two numbers are provide, number will be randomly sampled within the range for each corpora. |
len_mean |
Integer(s) indicating the mean of the document lengths in the parent normal sampling distribution. If two numbers are provided, number will be randomly sampled within the range for each corpora. |
len_var |
Integer(s) indicating the standard deviation of the document lengths in the parent normal sampling distribution. If two numbers are provided, number will be randomly sampled within the range for each corpora. |
len_min |
Integer(s) indicating the minimum of the document lengths in the parent normal sampling distribution. If two numbers are provided, number will be randomly sampled within the range for each corpora. |
len_max |
Integer(s) indicating the maximum of the document lengths in the parent normal sampling distribution. If two numbers are provided, number will be randomly sampled within the range for each corpora. |
seed |
Optional seed for reproducibility |
Dustin Stoltz and Marshall Taylor
# create corpus and DTM my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) # use colSums to get term frequencies df <- data.frame( vocab = colnames(dtm), freqs = colSums(dtm) ) # convert to probabilities df$probs <- df$freqs / sum(df$freqs) # create random DTM ls_dtms <- df |> rancors_builder(vocab, probs, n_cors = 20, n_docs = 100, len_mean = c(50, 200), len_var = 5, len_min = 20, len_max = 1000, seed = 59801 ) length(ls_dtms)
# create corpus and DTM my_corpus <- data.frame( text = c( "I hear babies crying I watch them grow", "They'll learn much more than I'll ever know", "And I think to myself", "What a wonderful world", "Yes I think to myself", "What a wonderful world" ), line_id = paste0("line", seq_len(6)) ) ## some text preprocessing my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text)) dtm <- dtm_builder( data = my_corpus, text = clean_text, doc_id = line_id ) # use colSums to get term frequencies df <- data.frame( vocab = colnames(dtm), freqs = colSums(dtm) ) # convert to probabilities df$probs <- df$freqs / sum(df$freqs) # create random DTM ls_dtms <- df |> rancors_builder(vocab, probs, n_cors = 20, n_docs = 100, len_mean = c(50, 200), len_var = 5, len_min = 20, len_max = 1000, seed = 59801 ) length(ls_dtms)
First, each token in the vocabulary is mapped to an integer in a lookup dictionary. Next, documents are converted to sequences of integers where each integer is an index of the token from the dictionary.
seq_builder( data, text, doc_id = NULL, vocab = NULL, maxlen = NULL, matrix = TRUE )
seq_builder( data, text, doc_id = NULL, vocab = NULL, maxlen = NULL, matrix = TRUE )
data |
Data.frame with column of texts and column of document ids |
text |
Name of the column with documents' text |
doc_id |
Name of the column with documents' unique ids. |
vocab |
Default is |
maxlen |
Integer indicating the maximum document length. If NULL (default), the length of the longest document is used. |
matrix |
Logical, |
Function will return a matrix of integer sequences by default.
The columns will be the length of the longest document or
maxlen
, with shorter documents padded with zeros. The
dictionary will be an attribute of the matrix accessed with
attr(seq, "dic")
. If matrix = FALSE
, the function will
return a list of integer sequences. The vocabulary will either
be each unique token in the corpus, or a the list of words
provided to the vocab
argument. This kind of text
representation is used in tensorflow
and keras.
returns a matrix or list
Dustin Stoltz
A dataset containing eight English stoplist. Is used
with the get_stoplist()
function.
stoplists
stoplists
A data frame with 1775 rows and 2 variables.
The stoplists include:
"tiny2020": Tiny (2020) list of 33 words (Default)
"snowball2001": Snowball (2001) list of 127 words
"snowball2014": Updated Snowball (2014) list of 175 words
"van1979": van Rijsbergen's (1979) list of 250 words
"fox1990": Christopher Fox's (1990) list of 421 words
"smart1993": Original SMART (1993) list of 570 words
"onix2000": ONIX (2000) list of 196 words
"nltk2001": Python's NLTK (2009) list of 179 words
Tiny 2020, is a very small stop list of the most frequent English conjunctions, articles, prepositions, and demonstratives (N=17). Also includes the 8 forms of the copular verb "to be" and the 8 most frequent personal (singular and plural) pronouns (minus gendered and possessive pronouns).
No contractions are included.
Variables:
words. words to be stopped
source. source of the list
This function evaluates how well an anchor set defines a semantic direction. Anchors must be a two-column data.frame or a list of length == 2. Currently, the function only implements the "PairDir" metric developed by Boutyline and Johnston (2023).
test_anchors(anchors, wv, method = c("pairdir"), all = FALSE, summarize = TRUE)
test_anchors(anchors, wv, method = c("pairdir"), all = FALSE, summarize = TRUE)
anchors |
A data frame or list of juxtaposed 'anchor' terms |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as terms. |
method |
Which metric used to evaluate (currently only pairdir) |
all |
Logical (default |
summarize |
Logical (default |
According to Boutyline and Johnston (2023):
"We find that PairDir – a measure of parallelism between the offset vectors (and thus of the internal reliability of the estimated relation) – consistently outperforms other reliability metrics in explaining axis accuracy."
Boutyline and Johnston only consider analyst specified pairs. However,
if all = TRUE
, all pairwise combinations of terms between each set
are evaluated. This can allow for unequal sets of anchors, however this
increases computational complexity considerably.
dataframe or list
Boutyline, Andrei, and Ethan Johnston. 2023. “Forging Better Axes: Evaluating and Improving the Measurement of Semantic Dimensions in Word Embeddings.” doi:10.31235/osf.io/576h3
# load example word embeddings data(ft_wv_sample) df_anchors <- data.frame( a = c("rest", "rested", "stay", "stand"), z = c("coming", "embarked", "fast", "move") ) test_anchors(df_anchors, ft_wv_sample) test_anchors(df_anchors, ft_wv_sample, all = TRUE)
# load example word embeddings data(ft_wv_sample) df_anchors <- data.frame( a = c("rest", "rested", "stay", "stand"), z = c("coming", "embarked", "fast", "move") ) test_anchors(df_anchors, ft_wv_sample) test_anchors(df_anchors, ft_wv_sample, all = TRUE)
Provides a small dictionary which matches common English pronouns and nouns to conventional gender categories ("masculine" or "feminine"). There are 20 words in each category.
tiny_gender_tagger()
tiny_gender_tagger()
returns a tibble with two columns
Dustin Stoltz
A streamlined function to take raw texts from a column of a data.frame and
produce a list of all the unique tokens. Tokenizes by the fixed,
single whitespace, and then extracts the unique tokens. This can be used as
input to dtm_builder()
to standardize the vocabulary (i.e. the columns)
across multiple DTMs. Prior to building the vocabulary, texts should have
whitespace trimmed, if desired, punctuation removed and terms lowercased.
vocab_builder(data, text)
vocab_builder(data, text)
data |
Data.frame with one column of texts |
text |
Name of the column with documents' text |
returns a list of unique terms in a corpus
Dustin Stoltz