Title: Large Language Model (LLM) Tools for Psychological Text Analysis
Version: 1.0.0
Maintainer: Lindley Slipetz <ddj6tu@virginia.edu>
Description: A collection of large language model (LLM) text analysis methods designed with psychological data in mind. Currently, LLMing (aka "lemming") includes a text anomaly detection method based on the angle-based subspace approach described by Zhang, Lin, and Karim (2015) <doi:10.1016/j.ress.2015.05.025>.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.3
Imports: Rdpack, quanteda, stopwords, stringi, reticulate, text, dbscan, pracma, stats
RdMacros: Rdpack
URL: https://github.com/sliplr19/LLMing
BugReports: https://github.com/sliplr19/LLMing/issues
NeedsCompilation: no
Packaged: 2025-10-16 16:16:03 UTC; ddj6tu
Author: Lindley Slipetz [aut, cre], Teague Henry [aut], Siqi Sun [ctb]
Depends: R (≥ 4.1.0)
Repository: CRAN
Date/Publication: 2025-10-21 17:50:06 UTC

LLMing: Text Analysis Tools for Psychological Data

Description

Package-level documentation and references.

Author(s)

Maintainer: Lindley Slipetz ddj6tu@virginia.edu

Authors:

Other contributors:

See Also

Useful links:


Thresholding of pCOS dataframe

Description

Converts each column of a pCOS score matrix into binary indicators

Usage

G_thres(pCOS_mat, theta)

Arguments

pCOS_mat

Dataframe of pCOS values

theta

Numeric threshold

Value

A matrix of 0s and 1s of which cells meet the threshold

Examples

z_dat <- data.frame("A" = rnorm(500,0,1), "B" = rnorm(500,0,1), "C" = rnorm(500,0,1))
snn <- sim_SNN(z_dat, 10, 5)
vec_snn <- vector_SNN(z_dat, snn)
pCOSdat <- pCOS(z_dat, vec_snn)
G <- G_thres(pCOSdat, theta = 0.1)


Embed texts with a Transformer model

Description

Cleans a text column and converts it to a dataframe of numeric vectors via BERT embeddings. For the input dataframe, each row is one text entry.

Usage

embed(dat, layers, keep_tokens = TRUE, tokens_method = NULL)

Arguments

dat

A dataframe with text data, one text per row

layers

Integer vector specifying which model layers to aggregate from.

keep_tokens

Logical, keep token-level embeddings in the returned object or discard them to save memory

tokens_method

Character scalar controlling how token-level embeddings are aggregated to word types

Value

A dataframe where each row corresponds to one input text and each column is an embedding dimension

@examples df <- data.frame( text = c( "I slept well and feel great today!", "I saw from friends and it went well.", "I think I failed that exam. I'm such a disappointment." ) )

emb_dat <- embed( dat = df, layers = 1, keep_tokens = FALSE, tokens_method = "mean" )


Local outlier score

Description

Computes a normalized Mahalanobis distance score. Only features with nonzero scores in S receive nonzero Mahalanobis scores.

Usage

normahalo(z, rs, S)

Arguments

z

Dataframe of z scores

rs

List of reference sets

S

Dataframe of numeric values

Value

A dataframe of local outlier scores


pCOS scores for every row of dataframe

Description

Applies pCOS_row() to corresponding rows of two data frames, returning one pCOS value per row.

Usage

pCOS(z_dat, vec_SNN)

Arguments

z_dat

Numeric dataframe, typically z-scores

vec_SNN

Numeric dataframe, typically the output of vector_SNN

Value

A dataframe with same dimensions as z_dat


Pairwise cosine-style row score

Description

Given two numeric vectors, computes an average cosine-based similarity.

Usage

pCOS_row(z, v_SNN)

Arguments

z

Numeric vector

v_SNN

Numeric vector, same size as z

Value

A numeric vector


The vectors of the shared nearest neighbors

Description

Creates a list of the vectors of the top shared nearest neighbors for each row of the z dataframe

Usage

rep_set(z, snn)

Arguments

z

Dataframe of values of reference set

snn

Dataframe of shared nearest neighbors indices

Value

A list of dataframes where each row of the dataframe is the vector representation of a given shared nearest neighbor


Compute shared nearest neighbors

Description

Builds a shared nearest neighbors matrix and, for each row (observation), returns the indices of the top neighbors with the largest SNN overlap counts

Usage

sim_SNN(z_dat, k, tops)

Arguments

z_dat

A dataframe with numeric columns

k

An integer representing number of nearest neighbors

tops

An integer representing how many of shared nearest neighbors to return

Value

A dataframe of top rows with shared nearest neighbors


Text anomaly score

Description

Text anomaly detection method adapted from (Zhang et al. 2015).

Usage

textanomaly(dat, k, tops, theta)

Arguments

dat

A dataframe with text data, one text per row

k

An integer representing number of nearest neighbors

tops

An integer representing how many of shared nearest neighbors to return

theta

Numeric threshold

Value

Dataframe of local outlier score

References

Zhang L, Lin J, Karim R (2015). “An angle-based subspace anomaly detection approach to high-dimensional data: With an application to industrial fault detection.” Reliability Engineering & System Safety, 142, 482–497. ISSN 0951-8320, doi:10.1016/j.ress.2015.05.025.


Aggregate dataframe into mean feature vectors

Description

For each row of the SNN index matrix, this function takes the rows of reference dataframe, z, and computes their column means, yielding one mean vector per observation.

Usage

vector_SNN(z, snn)

Arguments

z

Numeric dataframe

snn

Dataframe of shared nearest neighbors indices

Value

Dataframe of same dimensions as z


Z-score on columns

Description

Z-score on columns

Usage

z_score(dat)

Arguments

dat

A dataframe with numeric cells

Value

A dataframe with numeric cells with the same dimensions as dat