Can Reduced-keyboard disambiguating text-entry handle spelling errors?

David MacKay and Dale Grover

Consider a disambiguating text-entry system using the standard 8-key keyboard. (For example, T9 or iTap on a mobile phone; a free software implementation for PCs is TAPIR.) Such systems allow one to write in English at roughly one press per character, but they work well only if the user enters words that are in the system's dictionary.

It would be nice if someone who makes spelling errors, through either miskeyings or lack of spelling knowledge, could nevertheless use a disambiguating keyboard.

The accompanying figure evaluates, crudely, the prospect for disambiguation when miskeyings are permitted. We assume that the user enters the correct number of letters, and count what happens to the density of words in 8-key space if the user may substitute either one or two characters in a word. We approximate the number of words of length $l$ in the expanded dictionary in these two cases by $8 l C_l$ and $8^2 l(l-1)/2$. (This is just a slight overcount.) We plot the density of words, defined to be the factor by which the number of distinct words is greater than the quantity $8^l$, that is, the number of strings of length $l$.

For reference, we plot the density of words in the normal dictionary. The density of the normal dictionary drops below 1 at length 4. A user of T9 is expected to agree that a density smaller than 1 implies that disambiguations (such as GOOD/HOME, KISS/LIPS, or PINT/RIOT) are needed rarely.

If a single keying error is permitted per word, the density of words becomes significantly greater than 1 for 4- and 5-letter words, and is slightly greater than 1 for 6-letter words.

If two keying errors are permitted per word, the density of words becomes significantly greater than 1 for 6- and 7-letter words, and is slightly greater than 1 for 8-letter words.

density

DJCM's take on this situation: allowing substitution errors might just fly. We can't hope to do error-correction for 3- or 4-letter words, but I reckon that spelling errors are much more likely in longer words, because it is harder to keep track of where you are up to in a long word like 'beginning'; for 5-letter words, the density of the original dictionary is significantly below 1; this means that most miskeyings will take us to strings that are not valid words; since the density at length 5 is about 7 (orange curve), the list of candidates would be about 7 long, and about 2 or 3 disambiguation selections might typically be needed to find the desired word. For words of length 6, the density (for one substitution error) is about 2, which means that most mis-keyings will require just one disambiguation selection. So I think Dale is right that some spelling correction is possible, for words of length greater than 5. Errors in words of length 5 will be a little clumsy to correct. 6 should work well. And Errors in words of length 7 or more should be instantly corrected, as long as there is exactly one substitution error.

Further work is required to fold in the possibilities of insertions and deletions too.


Appendix

This list of word counts was obtained using the following unix commands

   perl -p -e "s/.*\'s\s//g ; " /usr/share/dict/words
   | lower.p
   | sort
   | uniq
   | lengthcount.p
   
which remove all possessives from the system dictionary and ignore case.

Here's the raw data.

# Length  Count
1       26
2       167
3       755
4       3003
5       5942
6       9133
7       11837
8       11905
9       10347
10      8037
11      5378
12      3330
13      1834
14      823
15      380
16      143
17      64
18      24
19      6
20      4
21      2
22      2
 

And here's the gnuplot command

 plot  'LENGTHS' u 1:($2*8.0*8.0*$1*($1-1)/2.0/8.0**$1) \
 t '2 substitution errors' w linesp 3 2,\
  'LENGTHS' u 1:($2*8.0*$1/8.0**$1) \
  t '1 substitution error' w linesp 2 3,\
  1 t '' w l ls 8,\
  'LENGTHS' u 1:($2/8.0**$1) \
  t 'normal dictionary' w linesp 5 4