Automatic identification of confusable drug names☆
Introduction
Many hundreds of drugs have names that either look or sound so much alike that doctors, nurses and pharmacists can get them confused, dispensing the wrong one in errors that can injure or even kill patients. In the United States alone, an estimated 1.3 million people are injured each year from medication errors, such as administering the wrong dose or the wrong drug [1]. For example, a patient needed an injection of Narcan but instead got the drug Norcuron and went into cardiac arrest. The U.S. Food and Drug Administration (FDA) has sought to mitigate this threat by ensuring that proposed drug names that are too similar to pre-existing drug names are not approved [2]. This has motivated the research and design of algorithms underlying phonetic orthographic computer analysis (POCA), an operational system implemented by the Project Performance Corporation for the FDA.1
A number of different lexical similarity measures have been applied to the problem of identifying confusable drug names (henceforth referred to as confusion pairs). For example, 22 distinct methods were tested on a set of drug names extracted from published reports of medication errors [3]. The methods included well-known universal measures, such as edit distance, longest common subsequence, and several variations of measures based on counting common letter n-grams, as well as measures designed specifically for associating phonetically similar names, such as Soundex and Editex. The normalized edit distance, Editex, and a trigram-based measure were identified as the most accurate.
We formulate a general framework for representing word similarity measures based on n-grams, and propose a new measure of orthographic similarity called BI-SIM that combines the advantages of several known measures. We show that this new measure performs better on a U.S. pharmacopeial list of confusable drug names than the measures previously identified as the most accurate by [3].
In addition, we present techniques for detecting drug-name confusions that are attributed solely to high phonetic similarity. Consider the example of Xanax versus Zantac —two brand names that the Physicians’ Desk Reference (PDR) warns may be “mistaken for each other lead[ing] to serious medication errors” [4]. The phonetic transcription of the two names, [zænæks] and [zæntæk], reveals a sound-alike similarity that is not apparent in their orthographic form. For the detection of sound-alike confusion pairs, we apply the ALINE phonetic aligner [5], which estimates the similarity between two phonetically-transcribed words. We demonstrate that ALINE outperforms orthographic approaches on a test set containing sound-alike confusion pairs.
We present a novel method of evaluating the accuracy of a measure, which aims at emulating the perspective of a person involved in the process of approving a new drug name. Our approach is to average recall values for each drug name in the test set. The recall is calculated against a published list of confusable drug names considering only the top k potential confusion pairs returned by a similarity measure. The recall values are then aggregated using the technique of macro-averaging [6].
The next section provides the background for the problem we are addressing, several commonly-used measures of word similarity, and our methodology for evaluation. After this, we present two new methods for identifying look-alike and sound-alike drug names. We then compare the effectiveness of various measures using our recall-based evaluation methodology on a U.S. pharmacopeial list and on another test set containing sound-alike confusion pairs. We conclude with a discussion of our experimental results.
Section snippets
Background
The problem of automatic identification of confusable drug names can be stated as follows: given a large set of existing drug names, identify all pairs or sets of drug names that are potentially confusable with each other. An alternative formulation reflects the process of approving a newly proposed drug name: given a proposed drug name and a large set of existing drug names, identify all drug names in a large set of existing drug names that are potentially confusable with the proposed drug
Orthographic similarity: N-SIM
In this section, we describe the inherent strengths and weaknesses of n-gram and subsequence-based approaches. Next, we present a new, generalized framework, N-SIM, that encompasses a number of commonly used similarity measures. Following this, we describe the parametric settings for BI-SIM—a specific instantiation of this generalized framework which is aimed at combining the advantages of LCSR and BIGRAM.7
Phonetic similarity: ALINE
In the preceding section, we proposed a new measure of orthographic similarity for identifying look-alike drug names. However, the detection of sound-alike confusion pairs often requires a different kind of approach. For this purpose, we employ ALINE [5], which computes phonetic similarity between pairs of phonetically-transcribed words. Its underlying principle is the decomposition of phonemes into elementary articulatory phonetic features.10
Evaluation methodology
We designed a new method for evaluating the accuracy of a similarity measure. Our aim was to emulate the perspective of a person involved in the process of approving a new drug name. Because of the sheer number of pharmaceutical products already in existence, it is very difficult for anyone to think of all possible drug names that may be confused with the newly proposed name. A computer program can facilitate this task by presenting the human expert with a ranked list of potential confusion
Experiments and results
We conducted two experiments with the goal of evaluating the relative accuracy of several measures of similarity in identifying confusable drug names. The first experiment was performed against a list of similar drug names reported to the USP Medication Errors Reporting Program [41] (henceforth the USP set). The USP set is a list of 363 confusion sets (both look-alike and sound-alike), which contain 582 unique drug names. Most of the confusion sets are pairs of names, but some contain three or
Discussion
The results described in Section 6 clearly indicate that BI-SIM and TRI-SIM, the newly proposed measures of orthographic similarity, outperform several currently used measures on the USP (mixed) test set regardless of the choice of the cutoff parameter k. On the sound-alike test set, EDITEX and ALINE are the most effective. However, a simple combination of several measures achieves even higher accuracy, exceeding 90% with only the 15 top pairs considered. It is worth noting that NED does
Conclusion
We have investigated the problem of identifying confusable drug name pairs. The effectiveness of several word similarity measures was evaluated using a new recall-based evaluation methodology. We have proposed a new measure of orthographic similarity that outperforms several commonly used similarity measures when tested on a publicly available list of confusable drug names. On a test set containing solely sound-alike confusion pairs, phonetic approaches, ALINE and EDITEX achieve the best
Acknowledgments
The first author’s research was supported by Natural Sciences and Engineering Research Council of Canada and the second author’s research was supported by the National Science Foundation. In addition, we are indebted to Project Performance Corporation, specifically, Erica Kolatch, Rick Shangraw, and Jessica Toye, for their implementation of our techniques in the POCA system.
References (43)
- et al.
Priming lexical neighbours of spoken words: effects of competition and inhibition
J Mem Lang
(1989) - et al.
Very fast and simple approximate string matching
Inform Process Lett
(1999) - et al.
Effect of orthographic and phonological similarity on false recognition of drug names
J Soc Sci Med
(2001) Approximate string-matching with q-grams and maximal matches
Theoret Comput Sci
(1992)- et al.
The use of an association measure based on character structure to identify semantically related pairs of words and document titles
Inform Stor Retrieval
(1974) - et al.
Identification of common molecular sequences
J Mol Biol
(1981) - et al.
Incidence of adverse drug reactions in hospitalized patients
J Am Med Assoc
(1998) Strategies to reduce medication errors
US Food Drug Admin Consum Mag
(2003)- et al.
Similarity as a risk factor in drug-name confusion errors: The look-alike (orthographic) and sound-alike (phonetic) model
Med Care
(1999) - Physicians’ desk reference for nonprescription drugs and dietary supplements. 24th ed. New York, NY: Thomson PDR;...
Phonetic alignment and similarity
Comput Human
The smart system: experiments in automatic document processing
Word recognition: context effects without priming
Cognition
Listening for mispronunciations: a measure of what we hear during speech
Percept Psychophys
The psychology of language: from data to theory
A stochastic parts program and noun phrase parser for unrestricted text
Knowledge sources for word-level translation models
Identifying cognates by phonetic and semantic similarity
Bitext maps and alignment via pattern recognition
Comp Linguist
Using cognates to align sentences in bilingual corpora
Cited by (0)
- ☆
A preliminary version of this paper appeared in Proceedings of the 20th International Conference on Computational Linguistics, Geneva (2004) pp. 952–958.