The Data Workshop1 is a web-based data exploration environment intended to help users understand elementary experiment design (data definition, collection, and organization) and subsequent analysis (data representation and statistical examination).2 Along with entering their own raw data, users can generate data by creating and carrying out surveys in the Data Workshop Survey Room, or by designing Simulations such as coin flips and dice tosses with customizable setups. All generated results can then be extensively analyzed in the Data Workshop Workbench. Both raw data and results can then be shared with other Workshop users.

We have chosen to highlight experiment design separately from analysis to focus on a growing trend where students are given data sets collected by others and then spend their time learning how to find statistics to describe features such as central tendency or spread and make graphs. The trend is to some extent understandable for both practical and professional reasons.

Practically, entering data into software and checking that it is entered properly is time-consuming and class time is in short supply especially if learning about data analysis is integrated into the school curriculum as in recent NSF-funded curriculum projects such as UCSMP (2015) and IMP (2012) or mandated in recent standards such as CCSSM (2015). So having prepackaged data sets allows class time to be spent on learning the skills needed to find statistics and make graphs or other displays rather than entering data into software packages. Our concern is that, at least for beginning students, skipping the step of defining variables and how they are to be measured or classified leaves them unprepared to creatively use data in new situations in their futures.

Professionally, it can be argued that many expert statisticians do not spend time collecting their own data but analyzing someone else's. Yet part of that expert analysis is necessarily spent understanding the variables and measures in a data set and determining if they are appropriate, accurate, valid, and reliable. Having beginning students focus on these defining attributes of data is essential to their understandings of conclusions based on analyzing data whether it's their own or from other sources. This type of understanding is essential to future data analysis needs, in school or out.

Experiment design

Data is entered into most elementary data analysis software via a spreadsheet, modified spreadsheet or simple list (see Data Desk [2015], Fathom [2015], Tinkerplots [2015], TI [2015]). This has the advantage of allowing a user to easily type a value in a cell, but the variable type is not indicated. Therefore, numbers may then be interpreted as exact or rounded to an arbitrary decimal place and all non-numeric values, including categorical data, are simply strings of characters. As a result, finding errors is left almost entirely to the user and errors that are overlooked are treated as actual data. For example, typing in "blue", "red", "blu", "rd" is interpreted in analysis or graphs as a category variable with 4 values.

A more problematic result of freehand entry into table cells is that the precision of measurements is lost. As with calculators, typical data analysis applications accept any number as exact and display some default precision—for example, entering 2.2 might be displayed as 2.2000000. Although using the default (or full floating-point) precision may be fine for calculations, the displayed value is incorrect to the extent that it implies the measurement is much more accurate than it is.

The Workbench's primary approach to helping users understand the variables they are studying in an experiment is to require identification of each variable's type and, if necessary, a clear definition of how it is to be measured. Variable types are:

  • Numerical: Three classes—any unit, percents, or currency. The first two classes are to a defined decimal precision and the latter to hundredths.
  • Category: Discrete data with up to 12 categories.
  • Formula: Derivations using a standard set of operations, functions, and previously defined numerical variables.
  • Generated: Iteratively defined sequences or random number sets.
  • Text: Character strings primarily for labeling data table rows.

Along with helping users tie variable definition and precision to experiment design, the requirement for setting a data type has the benefit of making "clean" data tables. Numbers entered into a variable column are limited to the column's defined precision and class. Values of Category variables are selected in dropdown lists rather than typed. Formula and Generated values are consistent with the attributes of the variables used in their derivation, i.e., precision is maintained through calculations by rounding results to the least precise measures.

The Data Workshop provides two ways to help users generate data:

  • Surveys: Design, create, and conduct electronic surveys.
  • Simulations: Design custom simulations of typical, and some non-typical, probabilistic experiments.

Surveys can be designed and created using a set of question types that map to the variables and their associated attributes in the Workbench. Questions with numeric responses can be designed to collect clean responses with specific measures (e.g., a question about age can be set to only accept whole-number years in a given range) or well-defined multiple choice categories with either single- or multiple-response options (often called radio button or checkbox responses, respectively). Once designed, tested and activated, users can give prospective participants a URL to respond to the survey and the results are dynamically updated in a table for analysis in the Workbench.

Simulations currently include Spinners (one or two spinners each with adjustable numbers of weighted segments), Coin Flip (up to 10 coins per toss), and Dice Toss (one or two dice each with 4, 6, 8, or 12 faces and weighted face values). All simulations allow the user to conduct single events, multiple events (e.g., toss a die 20 times), and repeated events with the option of accumulating the data. Simulation results can then be represented with any of the tables or graphs available in the Workbench.

Data analysis

The current version of the Data Workshop has a collection of single- and multi-variable statistics and graphs appropriate for an introductory probability and statistics course for preservice elementary school teachers. These statistics and graphs are associated with specific data types:

  • Numerical:
    • Frequency tables
    • Landmarks—count, mode, mean, minimum, median, first and third quartiles, maximum, range, interquartile range, and standard deviation
    • Histograms
    • Boxplots
    • Scatter plots
  • Category:
    • Frequency tables
    • Contingency tables
    • Bar graphs (simple and segmented)
    • Circle graphs
  • Formula and Generated: Same as for numerical data

An important consequence of careful definition of data types prior to data entry is that the software can guide the user toward appropriate tables and graphs.3 This happens both by disabling inappropriate menu options in the main Statistics and Graph menus and by suggesting appropriate alternative graphs within table and graph windows. For example, Figure 1 shows part of a table with the numeric variable "Attendance 2012" selected. The main Statistics menu only shows the two appropriate choices, Frequency Table and Landmarks, and the inappropriate Contingency Table option is unavailable.

Screenshot of Data Workbench
Figure 1: Statistics menu options for a Numerical variable. Data taken from TuvaLabs; originally from Global Attractions Attendance Report.

Figure 2 shows additional Graphs and Tables available for the box plot of the numeric variable "Attendance 2012".

Screenshot of Data Workbench
Figure 2: Suggested additional Graphs and Tables for the Numeric variable "Attendance 2012".

Other data analysis features of the Data Workshop include:

  • Commitment to the highest standards of data validity and reliability by appropriately managing the editing of experimental data. For example, editing the precision of numeric data after entry is discouraged because measurement accuracy is dependent on the measuring device, not the data analyst's preferences; the number of categories in a survey question cannot be edited once a survey is activated because participants will not have had the revised options when answering the question.
  • Automatic, dynamic updating of all representations when data tables are edited. Representations always accurately reflect the data in its current form, something not found in many data analysis applications.
  • Cloud-based file sharing to allow people to accumulate data of similar experiments collected at different times or to represent and analyze other people's data.
  • Sharing of survey templates to allow accumulation of data across multiple sources.

For more information about the Data Workshop Project, please contact us at:

dataworkshop@lists.uchicago.edu
UChicago STEM Education
University of Chicago

Bibliography

Common Core State Standards Initiative. Common Core State Standards for Mathematics. http://www.corestandards.org/assets/CCSSI_Math%20Standards.pdf (accessed February 5, 2016).

Data Description, Inc. Data Desk. http://www.datadesk.com (accessed February 5, 2016).

Fathom. Fathom Dynamic Data Software. http://concord.org/fathom-dynamic-data-software (accessed February 5, 2016).

It's About Time. Interactive Mathematics Program. https://www.iat.com/courses/mathematics/interactive-mathematics-program/?type=introduction (accessed February 5, 2016).

McBride, Jim. Data/Probability Tool for Use with Elementary Mathematics Curricula. Data Tool Specs2.doc. 22 August 2008.

McBride, Jim. Revised Thoughts on the Data Tool. Revised Thoughts June29 2009 – edited.doc. 29 June 2009.

McBride, Jim. Revised Thoughts on the Data Tool. More Thoughts Aug31 2009.doc. 31 August 2009.

McBride, Jim. Specs for Data Work Procedures (Jan 30, 2010) (annotated). Specs for Data Work procedures DEC 05 11 10 JF.doc. 18 May 2010.

The R Foundation. The R Project for Statistical Computing. https://www.r-project.org (accessed February 8, 2016).

SAS Institute, Inc. SAS Statistical Analysis. http://www.sas.com/en_us/software/analytics/high-performance-statistics.html (accessed February 8, 2016).

IBM. SPSS Statistics. http://www-01.ibm.com/software/analytics/spss/products/statistics/ (accessed February 8, 2016).

TI. TI Graphing Calculators appropriate for Statistics/AP Statistics courses. http://education.ti.com/en/us/product-resources/graphing_course_comparision (accessed February 8, 2016).

Learn Troop. TinkerPlots Software. http://www.tinkerplots.com (accessed February 8, 2016).

University of Chicago. UCSMP Grades 6 – 12 Curriculum Features. http://ucsmp.uchicago.edu/secondary/curriculum-features/ (accessed February 8, 2016).

1 The Data Workshop Project began in Fall 2012 as a continuation of the Data Tools part (2008 – 2010) of the Web-Based Curriculum project of the Chicago Science Group and its products are based in design and specification documents from Jim McBride (McBride 2008, 2009a, 2009b, 2010).

2 Although its tools are robust, the Data Workshop is primarily an educational tool and does not intend to provide the kind of full-featured professional support available in statistical programming languages or data-mining applications such as R (2015), SAS (2015), or SPSS (2015).

3 An early field-test instructor noted that the built-in guidance of the Workbench made her evaluate how much time was spent discussing the nonsensical graphs of other elementary data analysis software that are generated because data is not defined by type.