Mopping and Scrubbing Dirty Data

Data-driven decision making (D3M) appears to be the new "buzz phrase" for this century – the age of information. Teachers and administrators are engaging in data-centered dialogue in grade-level meetings, lounges, hallways, and classrooms as they "brainstorm" toward closing the gap in student achievement. Such discussion among professional educators has dramatically increased since the enactment of the No Child Left Behind Act of 2001. As a result of this Act, the teaching community is establishing data repositories to analyze information to improve teaching and learning within the school environment. Business organizations specializing in data collection, analysis, and technology reporting are rapidly emerging in large numbers to assist schools as they struggle to meet state, federal, and local requirements. Similarly, many software vendors now serve the education market through their development of disaggregation products to help schools meet their goals and close the gap in student achievement via D3M.

Data Repository

Data collection within schools toward the improvement of teaching and learning is not a new practice. Schools have been collecting, analyzing and reporting data manually to local, state, and federal agencies for many years. Over time, technically skilled employees began to develop electronic spreadsheets and databases to meet the increasing demand for data. However, as this demand continues to grow, increasingly greater sophistication in data analysis, necessitating highly qualified database personnel and systems, is also expected on an ongoing basis. No longer is large-scale data analysis conducted with "flat" database software such as Excel, File Maker Pro, or similar computer spreadsheet programs. Robust "relational" database systems, such as Microsoft SQL or Oracle engines, are now used to manipulate such data.

After immeasurable hours have been invested in the manual "crunching" of numbers, new technologies have minimized the burden of educational reporting. Technology has assumed the role of managing data, freeing teachers and administrators to focus on making decisions drawn from the data. The influx of Relational-Database Management Systems (RDMSs), Data Warehousing (DW) strategies, and procedures based upon the School Interoperability Framework (SIF) repository model is generating a knowledge base, or pool of data-analysis experts, that is significantly influencing the decision making of school administrators as they review the multiple measures now available to them. However, before any decision can be made with regard to developing or purchasing a RDMS, incorporating a DW strategy, or even considering basing an important decision on a SIF repository model, schools and school districts must have procedures in place for ensuring quality data.

Mopping and Scrubbing

To ensure data quality, two processes must be in place: the mop and the scrub. Mapping or mopping data is the process of locating or identifying where and how data are stored throughout the respective organization. This procedure provides database personnel with the information required to develop a data plan surrounding organizational needs, wants, costs, and desired outcomes.

Data collection and input are important steps in the data-analysis process toward successful D3M. Data collection begins with data mapping, and data mapping equates to surveying needs as well as the areas where data are located. Data can be found in unexpected places, such as boxes in a storage area; desk drawers or cupboards within a classroom; a file cabinet belonging to a principal; the files of a secretary; or stored on floppy disks, computer hard drives, CD-ROMs, recording tapes, notebooks or sticky pads, grade books, and notes places within teacher handbooks. Because of the various potential locations and conditions of data, once resurrected from their hidden places, a concoction of data-quality issues must often be resolved before analysis can even hope to begin. The best way to resolve these anomalies is to scrub the data.

Data scrubbing is often overlooked within the process of data collection. The common adage-garbage in, garbage out-is appropriately analogous to data collection and input. Data scrubbing pertains to removing erroneous pieces of information within a data set. These bits and pieces of information are debris that is "dirty" or "contaminated" and it can drastically affect the outcome of data analysis. The source of data impurities can range from inaccurate input; incomplete information; improperly formatted structures; and the most common, duplication of information. Consideration of impure or dirty data will result in a flawed analysis, potentially leading to an inaccurate prognosis and/or diagnosis with implementation of fatal interventions.

Data scrubbing requires no special art or science. A plethora of vendors selling data-cleansing software each project their individual products as the best on the market. Many of these products would not serve the educational environment in an effective manner. For example, the most popular scrubber within today's marketplace is a zip-code scrubber. Zip-code errors have the least impact of all demographic data affecting outcomes within a school environment. In most cases, students attend schools within their zip-code region, and there is no need for cleansing zip codes with expensive software. However, other demographic information entered into educational databases with no consistent rules or guidelines can indeed require future scrubbing. Data scrubbing or cleansing is crucial. The process results in high-quality data that are appropriate for effective data analysis. It removes fallacious marks or debris-datum by datum-either manually or through a series of scripts.

Datum by Datum

In a common database repository, a routine method for data input is rare unless specified guidelines and structures are in place that allow consistency. Information can be filed in a myriad of ways (e.g., name, initials, seating charts, or ID numbers). Sometimes data are stored on electronic media with first and last names or numeric or alphabetical identification. Numeric identification can consist of multiple digits that are left or right justified. For example, a student with an identification number of 45632 is most likely the same student as 00045632, but unlikely to be the same student as 45632000.

Another example is with the use of student addresses as identification. Some computer-database users might enter only the letter "W" to indicate "West" in an address, while others might type the full word in the street name. As simple as this may seem, incorrect entry contaminates the database. Entering the street information as "W. Parker Boulevard" is quite different from "West Parker Boulevard." The former refers to direction (i.e., the street name is "Parker," and the street runs east to west). The latter format, however, reflects a street with a direction running north-south or east-west. Even the "Boulevard" designation presents various formats (e.g., Bl., Blv., Bd., or Blvd.). Consistent and uniform formats upon data entry is critical to data quality. When simple guideline procedures are neglected, and shortcuts, abbreviations, careless entries, or use of common nomenclatures occur, the resulting data are impure and jeopardize proper analysis. With patterns of anomalies in information formatting, data corruption is likely. "Clean" data uphold the integrity of the analysis and effective D3M.

Dirty data are often associated with data-entry shortcuts, misunderstanding, and carelessness due to the personal preferences of the individual keying or otherwise inputting the data. Data entry remains a manual process. It requires a "floor plan" or procedural layout. Bad data can be compared to a stealth virus. It runs in the background, systematically "gnawing away" at the database residing on the hard drive before becoming public. It eventually strikes in an invasive manner and all data within the computer system are corrupted. Dirty data are much like a calcium buildup in a water pipe; sooner or later, the consequence will become evident. Data cleansing must become the responsibility of each individual playing a role in data creation and/or processing at any level. It begins with the instrument used to collect the data. Therefore, schools and districts should have a specific and consistent procedure, such as the following, in place for requesting data:

Use a standard form to request data.
Be specific about the information sought.
State the purpose for the request.
Explain how the data are needed.
Inform as to how often the data will be requested.