Difference between revisions of "Data Screening"

From Practical Statistics for Educators
Jump to: navigation, search
(Outliers)
 
(11 intermediate revisions by 2 users not shown)
Line 8: Line 8:
  
  
Contribution by: Britany Kuslis, WCSU Cohort 8
+
''contributed by Britany Kuslis, WCSU Cohort 8''
 +
 
  
 
Reference:
 
Reference:
Line 17: Line 18:
  
 
== Data Cleaning ==
 
== Data Cleaning ==
 +
 +
This process of representing original data. In its initial phases, involves the researcher looking for incomplete data that may skew futher data screening. For example if there is a 5 question subscale where all the scores are added and a mean derived, if a respondent does not answer a question, that would skew the subscale, and that set of answers should be removed.
 +
 +
''contributed by Mykal Kuslis, WCSU Cohort 8''
 +
 +
 +
Reference:
 +
 +
Meyers, S., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications. (p. 31-33)
  
 
== Value Cleaning ==
 
== Value Cleaning ==
Value cleaning is ensuring the values are "within the limits of reasonable expectation" within the "to the extent that it is possible...within the bounds of feasibility"(Meyers, Gamst, & Guarino, 2017, p. 32). For example, you want to ensure the age of a presumed adult is not 9 years old or that a response to an item rated on a likert scale of 1-5 is not a 6 or an otherwise value that is not within the bounds of the study.
+
'''Value cleaning''' is ensuring the values are "within the limits of reasonable expectation" within the "to the extent that it is possible...within the bounds of feasibility"(Meyers, Gamst, & Guarino, 2017, p. 32). For example, you want to ensure the age of a presumed adult is not 9 years old or that a response to an item rated on a likert scale of 1-5 is not a 6 or an otherwise value that is not within the bounds of the study.
  
''Contribution by: Britany Kuslis, WCSU Cohort 8''
+
''contributed by Britany Kuslis, WCSU Cohort 8''
  
 
Reference:
 
Reference:
Line 28: Line 38:
 
== Outliers ==
 
== Outliers ==
  
Outliers are values that are "extreme or unusual values on a single variable (univariate) or on a combination of variables (multivariate)" (Meyers, Gamst, & Guarino, 2017, p. 48).  
+
'''Outliers''' are values that are "extreme or unusual values on a single variable (univariate) or on a combination of variables (multivariate)" (Meyers, Gamst, & Guarino, 2017, p. 48).  
  
  
Line 40: Line 50:
 
Outliers may signal "anomalies within the data" that will likely need to be addressed prior to moving forward with the statistical analysis (Meyers, Gamst, & Guarino, 2017, p. 48).
 
Outliers may signal "anomalies within the data" that will likely need to be addressed prior to moving forward with the statistical analysis (Meyers, Gamst, & Guarino, 2017, p. 48).
  
''Contribution by: Britany Kuslis, WCSU Cohort 8''
+
 
 +
''contributed by Britany Kuslis, WCSU Cohort 8''
  
  
 
Reference:
 
Reference:
  
Meyers, S., Gamst, G, & Guarino, A.J. (2017). ''Applied multivariate research: Design and interpretation.'' Thousand Oaks, CA: Sage Publications.
+
Meyers, S., Gamst, G., & Guarino, A.J. (2017). ''Applied multivariate research: Design and interpretation.'' Thousand Oaks, CA: Sage Publications.
  
 
== Causes of Outliers ==
 
== Causes of Outliers ==
  
== Detection of Multivariate Outliers ==
+
- Data entry errors or improper attribute coding (normally caught in data cleaning)
 +
 
 +
- A function of extraordinary events or unusual circumstances (for example a traumatic event causing someone to forget what they learned or a person remembering all 80 facts)
 +
 
 +
- Some have no explanation, these are good cause for deletion.
 +
 
 +
-Multivariate outliers- a pattern of combination of valuable on several variables. 
 +
 
 +
''contributed by Mykal Kuslis, WCSU Cohort 8''
 +
 
 +
Reference:
  
 +
Meyers, S., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications. (p.48-49)
  
 
== Detection of Multivariate Outliers: Scatterplot Matrices ==
 
== Detection of Multivariate Outliers: Scatterplot Matrices ==
  
 +
Multivariate outliers uniqueness occurs in their pattern of combination of values on several variables. For example, a particular combination of age, sex, and number of arrests may be quite different from other combinations (young males in certain populations will have proportionally more arrests than other combinations of sex and age).
 +
 +
Running bivariate scatterplots for combinations of key variables.
 +
 +
Run a Scatterplot Matrices.
 +
 +
Each case is represented as a point on the X and Y axes.
 +
 +
Most cases will fall within the elliptical swarm or pattern mass, outliers are those cases that tend to lie outside the oval.
 +
 +
See page 52 on reference for example matrices.
 +
 +
''contributed by Mykal Kuslis, WCSU Cohort 8''
 +
 +
Reference:
 +
 +
Meyers, S., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications. (p.49-53)
  
 
== Detection of Multivariate Outliers: Mahalanobis Distance ==
 
== Detection of Multivariate Outliers: Mahalanobis Distance ==
  
The Mahalanobis Distance statistic measures "the multivariate 'distance' between each case and the group multivariate mean (known as centroid) taking into account the correlations between the variables" (Meyers, Gamst, & Guarino, 2017, p. 52). This method is used to determine if there are scores that vary from the mean of a set of DV's. The Mahalanobis distances details how far a case is from the group center mass of the predictor or IV's.  The greater the distance the higher the possibility of a multivariate outlier.  According to Lawrence S. Meyers, Glenn Gamst and A.J. Guarino, "Each case is evaluated using the chi square distribution with a stringent alpha level of .001.  Cases that reach this significance threshold can be considered multivariate outliers and possible candidates for elimination. This approach is also not without its critics (e.g., Wilcox, 2012) for alternative approaches to multivariate outlier detection" (Meyers, Gamst, & Guarino, 2017, p.53).
+
The '''Mahalanobis Distance''' statistic measures "the multivariate 'distance' between each case and the group multivariate mean (known as centroid) taking into account the correlations between the variables" (Meyers, Gamst, & Guarino, 2017, p. 52). This method is used to determine if there are scores that vary from the mean of a set of DV's. The Mahalanobis distances details how far a case is from the group center mass of the predictor or IV's.  The greater the distance the higher the possibility of a multivariate outlier.  According to Lawrence S. Meyers, Glenn Gamst and A.J. Guarino, "Each case is evaluated using the chi square distribution with a stringent alpha level of .001.  Cases that reach this significance threshold can be considered multivariate outliers and possible candidates for elimination. This approach is also not without its critics (e.g., Wilcox, 2012) for alternative approaches to multivariate outlier detection" (Meyers, Gamst, & Guarino, 2017, p.53).
 +
 
  
 
Identifying Multivariate Outliers with Mahalanobis Distance-->[https://www.youtube.com/watch?v=AXLAX6r5JgE]
 
Identifying Multivariate Outliers with Mahalanobis Distance-->[https://www.youtube.com/watch?v=AXLAX6r5JgE]
Line 63: Line 103:
 
Mahalanobis Distance -->[https://www.youtube.com/watch?v=spNpfmWZBmg]
 
Mahalanobis Distance -->[https://www.youtube.com/watch?v=spNpfmWZBmg]
  
''Contribution by: Britany Kuslis, WCSU Cohort 8''
+
 
 +
''contributed by Britany Kuslis, WCSU Cohort 8''
  
  

Latest revision as of 15:29, 19 November 2019

Data Screening

Once data from a research study is gathered and has been entered into SPSS, researchers must examine their data to be sure they can validly interpret their results. Valid interpretation of data is reliant on two data features:

1. The data must meet the assumptions of the analysis procedure.

2. The data in the data file are "an accurate representation or transcription of what was provided by research participants as their original responses or what was provided by archival sources as original data" (Meyers, Gamst, & Guarino, 2017, p. 31).


contributed by Britany Kuslis, WCSU Cohort 8


Reference:

Meyers, L., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications.


Data Cleaning

This process of representing original data. In its initial phases, involves the researcher looking for incomplete data that may skew futher data screening. For example if there is a 5 question subscale where all the scores are added and a mean derived, if a respondent does not answer a question, that would skew the subscale, and that set of answers should be removed.

contributed by Mykal Kuslis, WCSU Cohort 8


Reference:

Meyers, S., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications. (p. 31-33)

Value Cleaning

Value cleaning is ensuring the values are "within the limits of reasonable expectation" within the "to the extent that it is possible...within the bounds of feasibility"(Meyers, Gamst, & Guarino, 2017, p. 32). For example, you want to ensure the age of a presumed adult is not 9 years old or that a response to an item rated on a likert scale of 1-5 is not a 6 or an otherwise value that is not within the bounds of the study.

contributed by Britany Kuslis, WCSU Cohort 8

Reference: Meyers, S., Gamst, G, & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications.

Outliers

Outliers are values that are "extreme or unusual values on a single variable (univariate) or on a combination of variables (multivariate)" (Meyers, Gamst, & Guarino, 2017, p. 48).


The presence of outliers can greatly impact the results of an analysis for two major reasons:

(1) The mean of the variable might no longer be a good variable and

(2) Outliers will yield a difference that when squared will produce a value too large that will skew the computation.


Outliers may signal "anomalies within the data" that will likely need to be addressed prior to moving forward with the statistical analysis (Meyers, Gamst, & Guarino, 2017, p. 48).


contributed by Britany Kuslis, WCSU Cohort 8


Reference:

Meyers, S., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications.

Causes of Outliers

- Data entry errors or improper attribute coding (normally caught in data cleaning)

- A function of extraordinary events or unusual circumstances (for example a traumatic event causing someone to forget what they learned or a person remembering all 80 facts)

- Some have no explanation, these are good cause for deletion.

-Multivariate outliers- a pattern of combination of valuable on several variables.

contributed by Mykal Kuslis, WCSU Cohort 8

Reference:

Meyers, S., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications. (p.48-49)

Detection of Multivariate Outliers: Scatterplot Matrices

Multivariate outliers uniqueness occurs in their pattern of combination of values on several variables. For example, a particular combination of age, sex, and number of arrests may be quite different from other combinations (young males in certain populations will have proportionally more arrests than other combinations of sex and age).

Running bivariate scatterplots for combinations of key variables.

Run a Scatterplot Matrices.

Each case is represented as a point on the X and Y axes.

Most cases will fall within the elliptical swarm or pattern mass, outliers are those cases that tend to lie outside the oval.

See page 52 on reference for example matrices.

contributed by Mykal Kuslis, WCSU Cohort 8

Reference:

Meyers, S., Gamst, G., & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications. (p.49-53)

Detection of Multivariate Outliers: Mahalanobis Distance

The Mahalanobis Distance statistic measures "the multivariate 'distance' between each case and the group multivariate mean (known as centroid) taking into account the correlations between the variables" (Meyers, Gamst, & Guarino, 2017, p. 52). This method is used to determine if there are scores that vary from the mean of a set of DV's. The Mahalanobis distances details how far a case is from the group center mass of the predictor or IV's. The greater the distance the higher the possibility of a multivariate outlier. According to Lawrence S. Meyers, Glenn Gamst and A.J. Guarino, "Each case is evaluated using the chi square distribution with a stringent alpha level of .001. Cases that reach this significance threshold can be considered multivariate outliers and possible candidates for elimination. This approach is also not without its critics (e.g., Wilcox, 2012) for alternative approaches to multivariate outlier detection" (Meyers, Gamst, & Guarino, 2017, p.53).


Identifying Multivariate Outliers with Mahalanobis Distance-->[1]

Mahalanobis Distance -->[2]


contributed by Britany Kuslis, WCSU Cohort 8


References:

Clapham, Matthew E. “Mahalanobis Distance.” YouTube, YouTube.com, 2016, www.youtube.com/watch?v=spNpfmWZBmg.

Grande, Dr. Todd. “Identifying Multivariate Outliers with Mahalanobis Distance.” YouTube, YouTube.com, 2016, www.youtube.com/watch?v=AXLAX6r5JgE.

Meyers, L., Gamst, G, & Guarino, A.J. (2017). Applied multivariate research: Design and interpretation. Thousand Oaks, CA: Sage Publications.