Since the release of ChatGPT in 2022, researchers have been increasingly interested in analyzing text data using this and other tools from a new wave of generative artificial intelligence tools. Generative AI[1] expands on decades of development of natural language processing models—which seek to model patterns in text to allow machines to interpret, respond to, and produce human language—and represents a tremendous leap in the ability of computers to “understand” and produce text. While this technology has great potential, it introduces concerns about data privacy, ethics, and the accuracy of such tools.
AI can be used for a wide range of qualitative research applications; researchers use AI in this way because it offers unparalleled efficiency and quality control, particularly when dealing with a large volume of data. For example, automated techniques can identify common words, phrases, or topics in qualitative data and classify the data based on these topics. Like a human analyst, AI methods can code qualitative data inductively, deductively, or using a blended approach. They can also do so consistently across a large dataset, without any concern about codes being applied differently over time or by different analysts, and in a fraction of the time it would take a human coding team.
Generative AI tools introduce the ability to harness information and insights from existing models and apply them to new datasets, which makes these tools substantially more powerful than previous generations of AI. These tools function similarly to older AI tools in that they “learn” through multiple encounters with the same words, phrases, or topics. However, where older tools generally perform best with relatively large quantities of specific input data, generative AI tools have already learned patterns from billions of words scraped from the internet and can be applied without additional training data. As such, generative AI tools have the potential to make important contributions to qualitative data analysis.
Despite its promise, AI (especially generative AI) can present risks to security and privacy. When it comes to data security, the most common publicly available generative AI tools (ChatGPT, Gemini, Claude) do not meet standard data privacy requirements for personally identifiable information (PII) because these tools send data over the internet (similar to how one would not email data containing PII).[2] In addition, funders may have requirements about the privacy and security of AI systems used in their projects (for instance, the Department of Health and Human Services has specific guidelines).
Beyond this, there are deeper concerns about AI’s adherence to the Belmont Report’s principle of respect for persons, which requires that participants should know how their data will be used. Qualitative researchers must ensure that participants understand whether their data will be ingested into internet-connected AI tools, and further research will be required to determine if such use is widely acceptable. Americans broadly have reservations about the impact of AI on people’s privacy, so it is uncertain whether survey respondents or interview participants will feel comfortable with their answers being processed using such tools.
As it stands, researchers should not upload nonpublic qualitative data to internet-based AI tools and should instead consider one of the more secure options described below.
Several appealing options (including generative AI options) facilitate the secure and private analysis of data. These include firewalled instances of popular large language models (LLMs), smaller open-source models running on one’s own servers, and non-LLM techniques that often have lower technical requirements.
The most powerful method for applying AI to data that must be kept private and secure is through a custom, firewalled instance of an LLM that a company or institution may pay to maintain. For example, a university might contract with Microsoft to run their own instance of ChatGPT, which allows them to securely upload and analyze data while ensuring that these data are not used for model training in any way. This arrangement grants access to the full power of such models while preserving privacy and security; however, it is generally extremely expensive.
A more affordable alternative is for researchers to download and run open source[3] tools (for instance, Meta’s Llama3 model) on their own servers. These provide complete data privacy—as with any other data analysis software, they can run on a secure server with no interaction with the internet—and are available for free. Smaller versions of these models may fall short in performance relative to current cutting-edge models, while larger versions perform well but require substantial computing resources. Given the performance limitations of some models, their use requires careful benchmarking against human-coded results and evaluation of acceptable performance.
Finally, if neither of these approaches is feasible, researchers can explore other, non-LLM-based natural language processing approaches—including topic modelling, word embeddings, or supervised learning approaches—that can easily run within a secure data environment. These approaches are based on more traditional statistical and natural language processing techniques and require fewer computing resources and less code to run.
AI’s capabilities have exploded in the past few years and are expected to continue growing in new and unpredictable ways. This unprecedented growth means we cannot forecast with great accuracy where this technology will lead in the next two years. Nonetheless, we can expect many developments to impact AI’s potential to shape qualitative research. These advances are likely to include improvements in non-internet-connected models that facilitate privacy-preserving data analysis; expansion of tools that process and interact with video, image, and other forms of data; increasing use and acceptance of AI by the broader public; and deepening integration of AI into personal technology (computers, cell phones, glasses, cars) that may impact data collection.
[1] Generative AI is part of the broader field of AI, which covers, essentially, technologies that seek to allow computers to perform complex tasks that would require intelligence if done by a human.
[2] With most models, users can choose settings to ensure that data uploaded will not be used for model training; however, this is nevertheless seen as not perfectly secure.
[3] Open source software is software for which the source code is freely available, which users can download, run, and modify at will.
Kelley, S., Kelley, C., Solomon, B., & Wilkinson, A. (2025). Securely analyzing qualitative data with artificial intelligence. Child Trends. DOI: 10.56417/794r4275p
Sign up below
© Copyright 2025 ChildTrendsPrivacy Statement
Newsletter SignupLinkedInYouTubeBlueskyInstagram