The smacof package is the most
flexible package for multidimensional scaling (MDS) computation in R.
This vignette gives a very basic introduction to MDS using smacof. More
technical details and advanced examples can be found in the main package
vignette (vignette("smacof")
). An accessible introductory
MDS book written for applied researchers is Borg,
Groenen, and Mair (2018).
MDS is a method to represent associations among objects (variables, individuals, etc.) and requires a proximity matrix as input. We distinguish between two types of input data: directly observed proximities and derived proximities. In the first scenario the proximities are directly observed/collected in an experiment (e.g., participants rate pairwise similarities between stimuli). A classical example in the MDS literature is the Ekman dataset that comes with the package:
library(smacof)
#> Loading required package: plotrix
#> Loading required package: colorspace
#> Loading required package: e1071
#>
#> Attaching package: 'smacof'
#> The following object is masked from 'package:base':
#>
#> transform
ekman
#> 434 445 465 472 490 504 537 555 584 600 610 628 651
#> 445 0.86
#> 465 0.42 0.50
#> 472 0.42 0.44 0.81
#> 490 0.18 0.22 0.47 0.54
#> 504 0.06 0.09 0.17 0.25 0.61
#> 537 0.07 0.07 0.10 0.10 0.31 0.62
#> 555 0.04 0.07 0.08 0.09 0.26 0.45 0.73
#> 584 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33
#> 600 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58
#> 610 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74
#> 628 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76
#> 651 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85
#> 674 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76
The elements of this matrix are similarities (confusion proportions)
of colors with wavelengths from 434 nm to 674 nm. Important: smacof
requires dissimilarities as inputs! Therefore, we need to convert these
similarities into dissimilarities. The sim2diss()
function
supports the user with this conversion. Here we can simply subtract the
proportions from 1 (method = "confusion"
).
In modern MDS applications, the more common input data scenario are
derived proximities. The starting point is a standard data frame
structure. As an example we use a dataset on PTSD symptoms from the
MPsychoR
package.
library(MPsychoR)
data(Wenchuan)
head(Wenchuan)
#> intrusion dreams flash upset physior avoidth avoidact amnesia lossint distant
#> 1 2 2 2 2 3 2 3 2 3 2
#> 2 2 2 2 3 3 3 3 2 3 3
#> 3 2 4 4 4 3 3 3 5 4 3
#> 4 2 1 2 2 1 1 2 2 2 1
#> 5 2 2 2 2 2 2 2 2 3 2
#> 6 4 3 2 2 2 2 3 3 2 2
#> numb future sleep anger concen hyper startle
#> 1 2 1 3 4 3 4 2
#> 2 2 2 3 3 2 3 3
#> 3 2 3 4 4 4 3 4
#> 4 1 2 2 1 2 3 3
#> 5 2 2 3 2 3 2 3
#> 6 2 3 2 3 2 3 2
In total we have 17
PTSD symptoms collected on
362
individuals. We need to convert this matrix into
dissimilarities. If we want to scale the symptoms, a popular strategy is
to compute the correlation matrix (similarities) and subsequently
convert it into a dissimilarity matrix.
Rmat <- cor(Wenchuan, use = "pairwise.complete.obs")
Delta <- sim2diss(Rmat, method = "corr", to.dist = TRUE)
round(Delta, 2)
#> intrusion dreams flash upset physior avoidth avoidact amnesia lossint
#> dreams 0.53
#> flash 0.59 0.56
#> upset 0.65 0.65 0.67
#> physior 0.64 0.65 0.64 0.54
#> avoidth 0.75 0.73 0.71 0.67 0.71
#> avoidact 0.76 0.74 0.73 0.68 0.70 0.45
#> amnesia 0.77 0.76 0.73 0.74 0.73 0.73 0.72
#> lossint 0.77 0.78 0.77 0.77 0.75 0.70 0.69 0.74
#> distant 0.79 0.79 0.80 0.79 0.74 0.78 0.73 0.76 0.64
#> numb 0.81 0.84 0.80 0.83 0.77 0.82 0.81 0.75 0.73
#> future 0.72 0.79 0.75 0.73 0.74 0.81 0.79 0.73 0.77
#> sleep 0.69 0.69 0.71 0.70 0.72 0.70 0.73 0.79 0.73
#> anger 0.74 0.77 0.78 0.74 0.75 0.78 0.79 0.78 0.72
#> concen 0.76 0.74 0.72 0.75 0.74 0.77 0.78 0.75 0.66
#> hyper 0.69 0.68 0.67 0.70 0.66 0.74 0.72 0.75 0.71
#> startle 0.75 0.75 0.70 0.69 0.71 0.75 0.77 0.78 0.74
#> distant numb future sleep anger concen hyper
#> dreams
#> flash
#> upset
#> physior
#> avoidth
#> avoidact
#> amnesia
#> lossint
#> distant
#> numb 0.67
#> future 0.73 0.67
#> sleep 0.78 0.79 0.71
#> anger 0.76 0.74 0.71 0.63
#> concen 0.73 0.72 0.67 0.63 0.57
#> hyper 0.75 0.70 0.65 0.64 0.64 0.58
#> startle 0.80 0.74 0.72 0.63 0.67 0.59 0.53
At this point we are ready to fit an MDS.
What MDS is trying to achieve is to represent these dissimilarities as distances in a low-dimensional space. Most often researchers use two or three dimensions in order to be able plot the resulting configuration. The simplest call to perform a 2D MDS is the following.
mds_wen1 <- mds(Delta)
mds_wen1
#>
#> Call:
#> mds(delta = Delta)
#>
#> Model: Symmetric SMACOF
#> Number of objects: 17
#> Stress-1 value: 0.309
#> Number of iterations: 59
The print output shows a measure called “stress” which tells us how
well the solution fits the data. It becomes 0 in case of a perfect fit.
A value of 0.309
is not great for this simple MDS problem.
Note that in the MDS literature there exist stress rules of thumb. We
strongly advice to use several diagnostics in combination (see Mair, Borg, and Rusch 2016) in order to
assess the goodness-of-fit, instead of relying on these rules of
thumb.
This said, we need to improve this solution. We could increase the
number of dimensions but it will be difficult to plot the solution.
Rather, we can choose a more flexible transformation function. Modern
MDS implementations provide the option to transform the input
dissimilarities. This is a big advantage over classical scaling
implementations as in cmdscale()
. The default used in
mds()
is a ratio transformation which essentially says that
input dissimilarities are not touched. We could relax this restriction
by allowing something more flexible such as an ordinal transformation
which results in a step function. Shepard plots give insights into these
transformations.
mds_wen2 <- mds(Delta, type = "ordinal")
plot(mds_wen1, plot.type = "Shepard", main = "Shepard Diagram (Ratio Transformation)")
How should we decide on a transformation function? One option is to
decide on a data driven basis. Try out different transformation
functions ("interval"
is another popular one), look at the
Shepard plots, and pick one that leads to an acceptably low stress value
while keeping the transformation as simple as possible (an ordinal
transformation always gives the lowest stress as it is the most flexible
transformation). Another option is to decide on the basis of ad hoc
scale levels or substantive considerations. For instance, in this
example we could argue that the original data are measured on a Likert
scale and we therefore process the dissimilarities using an ordinal
transformation. More thorough discussions on such “optimal scaling”
procedures can be found in Mair
(2018).
Let us look at the output of the ordinal MDS fit:
mds_wen2
#>
#> Call:
#> mds(delta = Delta, type = "ordinal")
#>
#> Model: Symmetric SMACOF
#> Number of objects: 17
#> Stress-1 value: 0.145
#> Number of iterations: 22
We see the drastic reduction in the stress value compared to the ratio fit. Assume we are happy with this solution we are now ready to produce the most important output in MDS: the configuration plot.
A nice feature of MDS is that this configuration plot is super
intuitive to interpret as the location of the points matters
substantively (as opposed to correlation networks) and the distances
between any two points are Euclidean. The closer two symptoms in the
configuration, the stronger their association.
The interpretation of the dimensions is not that crucial as, for
instance, in PCA. Sometimes researchers find an interpretation, whereas
sometimes, as in our example, there is no straightforward
interpretation. Note that this configuration can be rotated arbitrarily
if this helps for interpretation. Further details on interpreting
configuration plots can be found in Borg,
Groenen, and Mair (2018), Mair
(2018), and the main package vignette.
As a final note, for users more attracted to ggplot
style, here is some basic code:
library(ggplot2)
conf_wen <- as.data.frame(mds_wen2$conf)
p <- ggplot(conf_wen, aes(x = D1, y = D2, label = rownames(conf_wen)))
p + geom_point(size = 0.9) + coord_fixed() + geom_text(size = 3.5, vjust = -0.8) +
ggtitle("PTSD Symptoms Configuration")
It is important to fix the aspect ratio to 1 in order to interpret
the distances in a Euclidean way. To avoid overlapping labels, the
ggrepel
package provides some nice options.