Clemens Reimann, Peter Filzmoser, Robert Garrett, Rudolf Dutter

Statistical Data Analysis Explained
Applied Environmental Statistics with R

WILEY. To appear in March 2008.
ISBN: 978-0-470-98581-6

Below is the table of contents of the book.
The underlined headings refer to figures in the sections.
All figures can be reproduced with the listed R code (R-scripts).
The main functions are available in the R package StatDA and may be useful for analysing your own data.

How to install R and the package StatDA? HELP

Data sets: HERE

CONTENTS

1 Introduction
       1.1 The Kola Ecogeochemistry Project
             1.1.1 Short Description of the Kola Project Survey Area
             1.1.2 Sampling and Characteristics of the Different Sample Materials
             1.1.3 Sample Preparation and Chemical Analysis

2 Preparing the Data for Use in R and DAS+R
       2.1 Required Data Format for Import in R and DAS+R
       2.2 The Detection Limit Problem
       2.3 Missing Values
       2.4 Some ``Typical'' Problems Encountered When Editing a Laboratory Data Report File to a DAS+R File
             2.4.1 Sample Identification
             2.4.2 Reporting Units
             2.4.3 Variable Names
             2.4.4 Results Below the Detection Limit
             2.4.5 Handling of Missing Values
             2.4.6 File Structure
             2.4.7 Quality Control Samples
             2.4.8 Geographical Coordinates, Further Editing and Some Unpleasant Limitations of Spreadsheet Programs
       2.5 Appending and Linking Data Files
       2.6 Requirements for a Geochemical Database
       2.7 Summary

3 Graphics to Display the Data Distribution
       3.1 The One-Dimensional Scatterplot
       3.2 The Histogram
       3.3 The Density Trace
       3.4 Plots of the Distribution Function
             3.4.1 Plot of the Cumulative Distribution Function (CDF-plot)
             3.4.2 Plot of the Empirical Cumulative Distribution Function (ECDF-plot)
             3.4.3 The Quantile-Quantile Plot (QQ-plot)
             3.4.4 The Cumulative Probability Plot (CP-plot)
             3.4.5 The Probability-Probability Plot (PP-plot)
             3.4.6 Discussion of the Distribution Function Plots
       3.5 Boxplots
             3.5.1 The Tukey Boxplot
             3.5.2 The log-Boxplot
             3.5.3 The Percentile Based Boxplot and the Box-and-Whisker Plot
             3.5.4 The Notched Boxplot
       3.6 Combination of Histogram, Density Trace, One-Dimensional Scatterplot, Boxplot, and ECDF-plot
       3.7 Combination of Histogram, Boxplot or Box-and-Whisker Plot, ECDF-plot, and CP-plot
       3.8 Summary

4 Statistical Distribution Measures
       4.1 Central Value
             4.1.1 The Arithmetic Mean
             4.1.2 The Geometric Mean
             4.1.3 The Mode
             4.1.4 The Median
             4.1.5 Trimmed Mean and Other Robust Measures of the Central Value
             4.1.6 Influence of the Shape of the Data Distribution
       4.2 Measures of Spread
             4.2.1 The Range
             4.2.2 The Interquartile Range (IQR)
             4.2.3 The Standard Deviation
             4.2.4 The Median Absolute Deviation (MAD)
             4.2.5 Variance
             4.2.6 The Coefficient of Variation (CV)
             4.2.7 The Robust Coefficient of Variation (CVR)
       4.3 Quartiles, Quantiles and Percentiles
       4.4 Skewness
       4.5 Kurtosis
       4.6 Summary Table of Statistical Distribution Measures
       4.7 Summary

5 Mapping Spatial Data
       5.1 Map Coordinate Systems (Map Projection)
       5.2 Map Scale
       5.3 Choice of the Base Map for Geochemical Mapping
       5.4 Mapping Geochemical Data With Proportional Dots
       5.5 Mapping Geochemical Data Using Classes
             5.5.1 Choice of Symbols for Geochemical Mapping
             5.5.2 Percentile Classes
             5.5.3 Boxplot Classes
             5.5.4 Use of ECDF- and CP-plot to Select Classes for Mapping
       5.6 Surface Maps Constructed With Smoothing Techniques
       5.7 Surface Maps Constructed With Kriging
             5.7.1 Construction of the (Semi)Variogram
             5.7.2 Quality Criteria for Semivariograms
             5.7.3 Mapping Based on the Semiovariogram (Kriging)
             5.7.4 Possible Problems With Semivariogram Estimation and Kriging
       5.8 Colour Maps
       5.9 Some Common Mistakes in Geochemical Mapping
             5.9.1 Incorrect Map Scale
             5.9.2 Incorrect Base Map
             5.9.3 Incorrect Symbol Set
             5.9.4 Incorrect Scaling of Symbol Size
             5.9.5 Incorrectly (e.g., Arbitrarily) Chosen Classes
       5.10 Summary

6 Further Graphics for Exploratory Data Analysis
       6.1 Scatterplots (xy-plots)
             6.1.1 Scatterplots with User Defined Lines or Fields
       6.2 Linear Regression Lines
       6.3 Time Trends
       6.4 Spatial Trends
       6.5 Spatial Distance Plot
       6.6 Spiderplots (Normalised Multi-Element Diagrams)
       6.7 Scatterplot Matrix
       6.8 Ternary Plots
       6.9 Summary

7 Defining Background and Threshold, Identification of Data Outliers and Element Sources
       7.1 Statistical Methods to Identify Extreme Values and Data Outliers
             7.1.1 Classical Statistics
             7.1.2 The Boxplot
             7.1.3 Robust Statistics
             7.1.4 Percentiles
             7.1.5 Can the Range of Background be Calculated?
       7.2 Detecting Outliers and Extreme Values in the ECDF- or CP-Plot
       7.3 Including the Spatial Distribution in the Definition of Background
             7.3.1 Using Geochemical Maps to Identify a Reasonable Threshold
             7.3.2 The Concentration-Area Plot
             7.3.3 Spatial Trend Analysis
             7.3.4 Multiple Background Populations in One Data Set
       7.4 Methods to Distinguish Geogenic from Anthropogenic Element Sources
             7.4.1 The TOP/BOT-Ratio
             7.4.2 Enrichment Factors (EFs)
             7.4.3 Mineralogical Versus Chemical Methods
       7.5 Summary

8 Comparing Data in Tables and Graphics
       8.1 Comparing Data in Tables
       8.2 Graphical Comparison of the Data Distributions of Several Data Sets
       8.3 Comparing the Spatial Data Structure
       8.4 Subset Creation -- a Mighty Tool in Graphical Data Analysis
       8.5 Data Subsets in Scatterplots
       8.6 Data Subsets in Time and Spatial Trend Diagrams
       8.7 Data Subsets in Ternary Plots
       8.8 Data Subsets in the Scatterplot Matrix
       8.9 Data Subsets in Maps
       8.10 Summary

9 Comparing Data Using Statistical Tests
       9.1 Tests for Distribution (Kolmogorov-Smirnov and Shapiro-Wilk Tests)
             9.1.1 The Kola Data Set and the Normal or Lognormal Distribution
       9.2 The One-Sample t-Test (Test for the Central Value)
       9.3 Wilcoxon Signed-rank Test
       9.4 Comparing Two Central Values of the Distributions of Independent Data Groups
             9.4.1 The Two-sample t-test
             9.4.2 The Wilcoxon Rank Sum Test
       9.5 Comparing Two Central Values of Matched Pairs of Data
             9.5.1 The Paired t-test
             9.5.2 The Wilcoxon Test
       9.6 Comparing the Variance of Two Data Sets
             9.6.1 The F-test
             9.6.2 The Ansari-Bradley Test
       9.7 Comparing Several Central Values
             9.7.1 One-way Analysis of Variance (ANOVA)
             9.7.2 Kruskal-Wallis Test
       9.8 Comparing the Variance of Several Data Groups
             9.8.1 Bartlett Test
             9.8.2 Levene Test
             9.8.3 Fligner Test
       9.9 Comparing Several Central Values of Dependent Groups
             9.9.1 ANOVA with Blocking (Two-way)
             9.9.2 Friedman Test
       9.10 Summary

10 Improving Data Behaviour for Statistical Analysis: Ranking and Transformations
       10.1 Ranking/Sorting
       10.2 Non-linear Transformations
             10.2.1 Square Root Transformation
             10.2.2 Power Transformation
             10.2.3 Log(arithmic) Transformation
             10.2.4 Box-Cox Transformation
             10.2.5 Logit Transformation
       10.3 Linear Transformations
             10.3.1 Addition/Subtraction
             10.3.2 Multiplication/Division
             10.3.3 Range Transformation
       10.4 Preparing a Data Set for Multivariate Data Analysis
             10.4.1 Centring
             10.4.2 Scaling
       10.5 The Special Case of Closed Number Systems
             10.5.1 Additive Logratio Transformation
             10.5.2 Centred Logratio Transformation
             10.5.3 Isometric Logratio Transformation
       10.6 Summary

11 Correlation
       11.1 Pearson Correlation
       11.2 Spearman Rank Correlation
       11.3 Kendall-tau Correlation
       11.4 Robust Correlation Coefficients
       11.5 When is a Correlation Coefficient Significant?
       11.6 Working With Many Variables
       11.7 Correlation Analysis and Inhomogeneous Data
       11.8 Correlation Results Following Additive Logratio or Centred Logratio Transformations
       11.9 Summary

12 Multivariate Graphics
       12.1 Profiles
       12.2 Stars
       12.3 Segments
       12.4 Boxes
       12.5 Castles and Trees
       12.6 Parallel Coordinates Plot
       12.7 Summary

13 Multivariate Outlier Detection
       13.1 Univariate Versus Multivariate Outlier Detection
       13.2 Robust Versus Non-robust Outlier Detection
       13.3 The Chi-square Plot
       13.4 Automated Multivariate Outlier Detection and Visualisation
       13.5 Other Graphical Approaches for Identifying Outliers and Groups
       13.6 Summary

14 Principal Component Analysis (PCA) and Factor Analysis (FA)
       14.1 Conditioning the Data for PCA and FA
             14.1.1 Different Data Ranges and Variability, Skewness
             14.1.2 Normal Distribution
             14.1.3 Data Outliers
             14.1.4 Closed Data
             14.1.5 Censored Data
             14.1.6 Inhomogeneous Data Sets
             14.1.7 Spatial Dependence
             14.1.8 Dimensionality
       14.2 Principal Component Analysis (PCA)
             14.2.1 The Scree Plot
             14.2.2 The Biplot
             14.2.3 Mapping the Principal Components
             14.2.4 Robust Versus Classical PCA
       14.3 Factor Analysis
             14.3.1 Choice of Factor Analysis Method
             14.3.2 Choice of Rotation Method
             14.3.3 Number of Factors Extracted
             14.3.4 Selection of Elements for Factor Analysis
             14.3.5 Graphical Representation of the Results of Factor Analysis
             14.3.6 Robust Versus Classical Factor Analysis
       14.4 Summary

15 Cluster Analysis
       15.1 Possible Data Problems in the Context of Cluster Analysis
             15.1.1 Mixing Major, Minor and Trace Elements
             15.1.2 Data Outliers
             15.1.3 Censored Data
             15.1.4 Data Transformation and Standardisation
             15.1.5 Closed Data
       15.2 Distance Measures
       15.3 Clustering Samples
             15.3.1 Hierarchical Methods
             15.3.2 Partitioning Methods
             15.3.3 Model-based Methods
             15.3.4 Fuzzy Methods
       15.4 Clustering Variables
       15.5 Evaluation of Cluster Validity
       15.6 Selection of Variables for Cluster Analysis
       15.7 Summary

16 Regression Analysis (RA)
       16.1 Data Requirements for Regression Analysis
             16.1.1 Homogeneity of Variance and Normality
             16.1.2 Data Outliers, Extreme Values
             16.1.3 Other Considerations
       16.2 Multiple Regression
       16.3 Classical Least Squares (LS) Regression
             16.3.1 Fitting a Regression Model
             16.3.2 Inferences from the Regression Model
             16.3.3 Regression Diagnostics
             16.3.4 Regression with Opened Data
       16.4 Robust Regression
             16.4.1 Fitting a Robust Regression Model
             16.4.2 Robust Regression Diagnostics
       16.5 Model Selection in Regression Analysis
       16.6 Other Regression Methods
       16.7 Summary

17 Discriminant Analysis (DA) and Other Knowledge Based Classification Methods
       17.1 Methods for Discriminant Analysis
       17.2 Data Requirements for Discriminant Analysis
       17.3 Visualisation of the Discriminant Function
       17.4 Prediction With Discriminant Analysis
       17.5 Exploring for Similar Data Structures
       17.6 Other Knowledge Based Classification Methods
             17.6.1 Allocation
             17.6.2 Weighted Sums
       17.7 Summary

18 Quality Control (QC)
       18.1 Randomised Samples
       18.2 Trueness
       18.3 Accuracy
       18.4 Precision
             18.4.1 Analytical Duplicates
             18.4.2 Field Duplicates
       18.5 Analysis of Variance (ANOVA)
       18.6 Using Maps to Assess Data Quality
       18.7 Variables Analysed by Two Different Analytical Techniques
       18.8 Working With Censored Data -- A Practical Example
       18.9 Summary

19 Introduction to R and Structure of the DAS+R Graphical User Interface
       19.1 R
             19.1.1 Installing R
             19.1.2 Getting Started
             19.1.3 Loading Data
             19.1.4 Generating and Saving Plots in R
             19.1.5 Scatterplots
       19.2 R-scripts
       19.3 A Brief Overview of Relevant R Commands
       19.4 DAS+R
             19.4.1 Loading Data into DAS+R
             19.4.2 Plotting Diagrams
             19.4.3 Tables
             19.4.4 Working with ``Worksheets''
             19.4.5 Groups and Subsets
             19.4.6 Mapping
       19.5 Summary

Bibliography

Index