Toggle Main Menu Toggle Search

Open Access padlockePrints

Does ChatGPT have sociolinguistic competence?

Lookup NU author(s): Dr Daniel DuncanORCiD

Downloads


Licence

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND).


Abstract

Large language models are now able to generate content- and genre-appropriate prose with grammatical sentences. However, these targets do not fully encapsulate human-like language use. For example, set aside is the fact that human language use involves sociolinguistic variation that is regularly constrained by internal and external factors. This paper tests whether one widely used LLM application, ChatGPT, is capable of generating such variation. I construct an English corpus of “sociolinguistic interviews” using the application and analyze the generation of seven morphosyntactic features. I show that the application largely fails to generate any variation at all when one variant is prescriptively incorrect, but that it is able to generate variable deletion of the complementizer that that is internally constrained, with variants occurring at human-like rates. ChatGPT fails, however, to properly generate externally constrained complementizer that deletion. I argue that these outcomes reflect bias both in the training data and Reinforcement Learning from Human Feedback. I suggest that testing whether an LLM can properly generate sociolinguistic variation is a useful metric for evaluating if it generates human-like language.


Publication metadata

Author(s): Duncan D

Publication type: Article

Publication status: Published

Journal: Journal of Computer-Assisted Linguistic Research

Year: 2024

Volume: 8

Pages: 51-75

Online publication date: 15/11/2024

Acceptance date: 24/09/2024

Date deposited: 25/09/2024

ISSN (electronic): 2530-9455

Publisher: Universitat Politecnica de Valencia

URL: https://doi.org/10.4995/jclr.2024.21958

DOI: 10.4995/jclr.2024.21958

Data Access Statement: The ChatGPT-generated transcripts, Python code used in processing transcripts, raw data containing sociolinguistic variable tokens, and R code used to analyze “family” size and complementizer that deletion may be found in an OSF repository with a CC-By Attribution 4.0 International license (Duncan 2024).


Altmetrics

Altmetrics provided by Altmetric


Share