How to visualize high-dimensional data: a roadmap
2020, J. Data Min. Digit. Humanit.
https://doi.org/10.46298/JDMDH.5594Abstract
Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines ...
References (21)
- Arabie P. and Hubert L. An overview of combinatorial data analysis. In: Clustering and Classification, ed. P. Arabie, L. Hubert, and G. De Soete. World Scientific (Singapore), 1996: 5-63.
- Audi R. Epistemology: A contemporary introduction to the theory of knowledge. Routledge (London), 2010.
- Bishop C. Neural networks for pattern recognition. Clarendon Press (Oxford), 1995.
- Deza M. and Deza, E. (2009) Encyclopedia of distances. Springer (Berlin), 2009.
- Everitt B. and Dunn G. Applied multivariate data analysis, 2nd ed. Arnold (London), 2001.
- Everitt B., Landau S., Leese M., and Stahl D.(2011): Cluster analysis, 5th ed. Wiley (Hoboken NJ), 2011.
- Gan G., Ma C., and Wu J. Data clustering. Theory, algorithms, and applications. American Statistical Association (Alexandria VA), 2007. Gordon A. Classification, 2nd ed. Chapman and Hall (London), 1999.
- Izenman A. Modern multivariate statistical techniques. Regression, classification, and manifold learning. Springer (Berlin), 2008.
- Jackson J. A user's guide to principal components. Wiley-Interscience (Hoboken NJ), 2003.
- Jain A. and Dubes R. Algorithms for clustering data. Prentice Hall (London), 1988.
- Jain A., Murty M. and Flynn P. Data clustering: a review. ACM Computing Surveys. 1999, 31: 264-323.
- Jolliffe I. Principal component analysis, 2nd ed. Springer (Berlin), 2002.
- Kaufman L. and Rousseeuw P. Finding groups in data. Wiley-Interscience (Hoboken NJ), 1990.
- Köppen M. The curse of dimensionality. 5th Online World Conference on Soft Computing in Industrial Applications (WSC5), 2000. Lee J. Introduction to topological manifolds, 2nd ed. Springer (Berlin), 2010.
- Mirkin B. Core concepts in data analysis: Summarization, correlation, and visualization. Springer (Berlin), 2011. Moisl H. Cluster analysis for corpus linguistics. De Gruyter (Berlin), 2015.
- Peissig J. and Tarr M. Visual object recognition: Do we know more than we did 20 years ago? Annual Review of Psychology. 2007, 58: 75-96.
- Strang G. Introduction to linear algebra, 5th ed., Wellesley-Cambridge Press (Cambridge), 2016.
- Tabachnick B. and Fidell L. Using multivariate statistics. Pearson Education (London), 2007.
- Tabak J. Geometry: The language of space and form. Facts on File (New York), 2011.
- Tan P., Steinbach, M. and Kumar, V. Introduction to data mining. Pearson Addison Wesley (London), 2006.
- Xu R. and Wunsch, D. Survey of clustering algorithms. IEEE Transactions on Neural Networks. 2005, 16: 645-78.