This site is a (hopefully) simple user interface to the UCLA Phonological Segment Inventory Database (UPSID).
This Database was compiled by Ian Maddieson and Kristin Precoda (cf. Maddieson, 1984)
and contains information on the distribution of 919 different segments in 451 languages.
Henning Reetz took the original data from a ZIP-file
and added the HTML-interface you see. Only few typographical changes
have been made.
Some terms used:
Segment frequency:
This is the number of languages that contains a specific segment divided by the number of languages in UPSID expressed in percent.
For example, a segment that is only found in one language has a frequency of (1 / 451) * 100 = 0.22
(or, in other words, it only exists in 0.2% of all languages in UPSID).
The most frequent segment in UPSID is the bilabial nasal /m/, which occurs in
425 languages and hence its segment frequency is 94.2%. There are 919 different segments in the database
and the complete list of all frequencies is rather long.
The 20 most frequent consonants and the 10 most frequent vowels are:
consonant: |
m |
k |
j |
p |
w |
b |
h |
g |
N |
? |
n |
s |
tS |
S |
t |
f |
l |
"n |
"t |
nj |
in languages: |
425 |
403 |
378 |
375 |
332 |
287 |
279 |
253 |
237 |
216 |
202 |
196 |
188 |
187 |
181 |
180 |
174 |
160 |
152 |
141 |
frequency: |
94.2 |
89.4 |
83.8 |
83.2 |
73.6 |
63.6 |
61.9 |
56.1 |
52.6 |
47.9 |
44.8 |
43.5 |
41.7 |
41.5 |
40.1 |
39.9 |
38.6 |
35.5 |
33.7 |
31.3 |
vowel: |
i |
a |
u |
E |
"o |
"e |
O |
o |
e |
a~ |
in languages: |
393 |
392 |
369 |
186 |
181 |
169 |
162 |
131 |
124 |
83 |
frequency: |
87.1 |
86.9 |
81.8 |
41.2 |
40.1 |
37.5 |
35.9 |
29.0 |
27.5 |
18.4 |
At the other end of the scale there are many segments that occur in one or only few languages:
Number of segments: |
427 |
117 |
66 |
39 |
27 |
19 |
14 |
14 |
12 |
13 |
|
that occur only in |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
languages |
% of all segments: |
46.46 |
12.73 |
7.18 |
4.24 |
2.94 |
2.07 |
1.52 |
1.52 |
1.31 |
1.41 |
|
cummulative %: |
46.46 |
59.19 |
66.38 |
70.62 |
73.56 |
75.63 |
77.15 |
78.67 |
79.98 |
81.39 |
|
That is, the group of sounds that appear in 10 or fewer of the 451 languages make up more than 80% of the 919 sounds in the database.
Number of segments in a language:
This is simply the number of segments that are in a language according to the UPSID database.
The histogram below shows the distribution of the number of segments across the 451 languages in UPSID.
min |
2.5% |
10% |
25% |
median |
mean |
75% |
90% |
97.5% |
max. |
11 |
16 |
20 |
23 |
29 |
30.97 |
36 |
43 |
58 |
141 |
There are actually two languages with 11 and one with 141 segments,
as can be seen in the respective list.
Frequency index:
This number is the arithmetic average of the segment frequencies of a language.
A language with mostly rare segments will have a low frequency index,
whereas a language with mostly common sounds will have a high frequency index.
A frequency index of 0.1 means that a language has many very rare segments;
0.7 means it has many common segments; the average frequency index of all languages is 0.39.
The histogram below shows the distribution of the frequency indices in UPSID.
min |
2.5% |
10% |
25% |
median |
mean |
75% |
90% |
97.5% |
max. |
.1057 |
.2044 |
.2663 |
.3300 |
.3891 |
.3909 |
.4520 |
.5147 |
.5785 |
.6562 |
Note that there is a relation between frequency index and number of segments in a language.
That is, if a language has only few segments, it is likely that these are rather common in the languages in UPSID.
On the other hand, a language with many segments will also have many segments that are uncommon in the UPSID database.
This does not necessarily mean that certain sounds are more natural but it is a probabilistic effect:
if you make a pot with many red marbles, few green marbles, and other marbles with different colors
and you draw a small random sample (i.e. 10 marbles) you will have mostly red marbles.
If you draw a large random sample (e.g. 100 marbles) you will have many single colored ones.
The scatterplot below shows the relation between the frequency index and the logarithm of the number of segments in a language
(the formula of the curve is "Freq. = 1.2282298 - 0.2479315 Log(nr_seg)" with an RSquare = 0.718 of the fit.
Report a bug