Title: | Large Language Model (LLM) Tools for Psychological Text Analysis |
Version: | 1.0.0 |
Maintainer: | Lindley Slipetz <ddj6tu@virginia.edu> |
Description: | A collection of large language model (LLM) text analysis methods designed with psychological data in mind. Currently, LLMing (aka "lemming") includes a text anomaly detection method based on the angle-based subspace approach described by Zhang, Lin, and Karim (2015) <doi:10.1016/j.ress.2015.05.025>. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.3 |
Imports: | Rdpack, quanteda, stopwords, stringi, reticulate, text, dbscan, pracma, stats |
RdMacros: | Rdpack |
URL: | https://github.com/sliplr19/LLMing |
BugReports: | https://github.com/sliplr19/LLMing/issues |
NeedsCompilation: | no |
Packaged: | 2025-10-16 16:16:03 UTC; ddj6tu |
Author: | Lindley Slipetz [aut, cre], Teague Henry [aut], Siqi Sun [ctb] |
Depends: | R (≥ 4.1.0) |
Repository: | CRAN |
Date/Publication: | 2025-10-21 17:50:06 UTC |
LLMing: Text Analysis Tools for Psychological Data
Description
Package-level documentation and references.
Author(s)
Maintainer: Lindley Slipetz ddj6tu@virginia.edu
Authors:
Teague Henry ycp6wm@virginia.edu
Other contributors:
Siqi Sun mgd6vc@virginia.edu [contributor]
See Also
Useful links:
Thresholding of pCOS dataframe
Description
Converts each column of a pCOS score matrix into binary indicators
Usage
G_thres(pCOS_mat, theta)
Arguments
pCOS_mat |
Dataframe of pCOS values |
theta |
Numeric threshold |
Value
A matrix of 0s and 1s of which cells meet the threshold
Examples
z_dat <- data.frame("A" = rnorm(500,0,1), "B" = rnorm(500,0,1), "C" = rnorm(500,0,1))
snn <- sim_SNN(z_dat, 10, 5)
vec_snn <- vector_SNN(z_dat, snn)
pCOSdat <- pCOS(z_dat, vec_snn)
G <- G_thres(pCOSdat, theta = 0.1)
Embed texts with a Transformer model
Description
Cleans a text column and converts it to a dataframe of numeric vectors via BERT embeddings. For the input dataframe, each row is one text entry.
Usage
embed(dat, layers, keep_tokens = TRUE, tokens_method = NULL)
Arguments
dat |
A dataframe with text data, one text per row |
layers |
Integer vector specifying which model layers to aggregate from. |
keep_tokens |
Logical, keep token-level embeddings in the returned object or discard them to save memory |
tokens_method |
Character scalar controlling how token-level embeddings are aggregated to word types |
Value
A dataframe where each row corresponds to one input text and each column is an embedding dimension
@examples df <- data.frame( text = c( "I slept well and feel great today!", "I saw from friends and it went well.", "I think I failed that exam. I'm such a disappointment." ) )
emb_dat <- embed( dat = df, layers = 1, keep_tokens = FALSE, tokens_method = "mean" )
Local outlier score
Description
Computes a normalized Mahalanobis distance score. Only features with nonzero scores in S receive nonzero Mahalanobis scores.
Usage
normahalo(z, rs, S)
Arguments
z |
Dataframe of z scores |
rs |
List of reference sets |
S |
Dataframe of numeric values |
Value
A dataframe of local outlier scores
pCOS scores for every row of dataframe
Description
Applies pCOS_row() to corresponding rows of two data frames, returning one pCOS value per row.
Usage
pCOS(z_dat, vec_SNN)
Arguments
z_dat |
Numeric dataframe, typically z-scores |
vec_SNN |
Numeric dataframe, typically the output of vector_SNN |
Value
A dataframe with same dimensions as z_dat
Pairwise cosine-style row score
Description
Given two numeric vectors, computes an average cosine-based similarity.
Usage
pCOS_row(z, v_SNN)
Arguments
z |
Numeric vector |
v_SNN |
Numeric vector, same size as z |
Value
A numeric vector
The vectors of the shared nearest neighbors
Description
Creates a list of the vectors of the top shared nearest neighbors for each row of the z dataframe
Usage
rep_set(z, snn)
Arguments
z |
Dataframe of values of reference set |
snn |
Dataframe of shared nearest neighbors indices |
Value
A list of dataframes where each row of the dataframe is the vector representation of a given shared nearest neighbor
Compute shared nearest neighbors
Description
Builds a shared nearest neighbors matrix and, for each row (observation), returns the indices of the top neighbors with the largest SNN overlap counts
Usage
sim_SNN(z_dat, k, tops)
Arguments
z_dat |
A dataframe with numeric columns |
k |
An integer representing number of nearest neighbors |
tops |
An integer representing how many of shared nearest neighbors to return |
Value
A dataframe of top rows with shared nearest neighbors
Text anomaly score
Description
Text anomaly detection method adapted from (Zhang et al. 2015).
Usage
textanomaly(dat, k, tops, theta)
Arguments
dat |
A dataframe with text data, one text per row |
k |
An integer representing number of nearest neighbors |
tops |
An integer representing how many of shared nearest neighbors to return |
theta |
Numeric threshold |
Value
Dataframe of local outlier score
References
Zhang L, Lin J, Karim R (2015). “An angle-based subspace anomaly detection approach to high-dimensional data: With an application to industrial fault detection.” Reliability Engineering & System Safety, 142, 482–497. ISSN 0951-8320, doi:10.1016/j.ress.2015.05.025.
Aggregate dataframe into mean feature vectors
Description
For each row of the SNN index matrix, this function takes the rows of reference dataframe, z, and computes their column means, yielding one mean vector per observation.
Usage
vector_SNN(z, snn)
Arguments
z |
Numeric dataframe |
snn |
Dataframe of shared nearest neighbors indices |
Value
Dataframe of same dimensions as z
Z-score on columns
Description
Z-score on columns
Usage
z_score(dat)
Arguments
dat |
A dataframe with numeric cells |
Value
A dataframe with numeric cells with the same dimensions as dat