Mopping and Scrubbing Dirty Data - Tech Learning

Mopping and Scrubbing Dirty Data

Data-driven decision making (D3M) appears to be the new "buzz phrase" for this century – the age of information. Teachers and administrators are engaging in data-centered dialogue in grade-level meetings, lounges, hallways, and classrooms as they "brainstorm" toward closing the gap in student achievement. Such
Publish date:

Data-driven decision making (D3M) appears to be the new "buzz phrase" for this century – the age of information. Teachers and administrators are engaging in data-centered dialogue in grade-level meetings, lounges, hallways, and classrooms as they "brainstorm" toward closing the gap in student achievement. Such discussion among professional educators has dramatically increased since the enactment of the No Child Left Behind Act of 2001. As a result of this Act, the teaching community is establishing data repositories to analyze information to improve teaching and learning within the school environment. Business organizations specializing in data collection, analysis, and technology reporting are rapidly emerging in large numbers to assist schools as they struggle to meet state, federal, and local requirements. Similarly, many software vendors now serve the education market through their development of disaggregation products to help schools meet their goals and close the gap in student achievement via D3M.

Data Repository

Data collection within schools toward the improvement of teaching and learning is not a new practice. Schools have been collecting, analyzing and reporting data manually to local, state, and federal agencies for many years. Over time, technically skilled employees began to develop electronic spreadsheets and databases to meet the increasing demand for data. However, as this demand continues to grow, increasingly greater sophistication in data analysis, necessitating highly qualified database personnel and systems, is also expected on an ongoing basis. No longer is large-scale data analysis conducted with "flat" database software such as Excel, File Maker Pro, or similar computer spreadsheet programs. Robust "relational" database systems, such as Microsoft SQL or Oracle engines, are now used to manipulate such data.

After immeasurable hours have been invested in the manual "crunching" of numbers, new technologies have minimized the burden of educational reporting. Technology has assumed the role of managing data, freeing teachers and administrators to focus on making decisions drawn from the data. The influx of Relational-Database Management Systems (RDMSs), Data Warehousing (DW) strategies, and procedures based upon the School Interoperability Framework (SIF) repository model is generating a knowledge base, or pool of data-analysis experts, that is significantly influencing the decision making of school administrators as they review the multiple measures now available to them. However, before any decision can be made with regard to developing or purchasing a RDMS, incorporating a DW strategy, or even considering basing an important decision on a SIF repository model, schools and school districts must have procedures in place for ensuring quality data.

Mopping and Scrubbing

To ensure data quality, two processes must be in place: the mop and the scrub. Mapping or mopping data is the process of locating or identifying where and how data are stored throughout the respective organization. This procedure provides database personnel with the information required to develop a data plan surrounding organizational needs, wants, costs, and desired outcomes.

Data collection and input are important steps in the data-analysis process toward successful D3M. Data collection begins with data mapping, and data mapping equates to surveying needs as well as the areas where data are located. Data can be found in unexpected places, such as boxes in a storage area; desk drawers or cupboards within a classroom; a file cabinet belonging to a principal; the files of a secretary; or stored on floppy disks, computer hard drives, CD-ROMs, recording tapes, notebooks or sticky pads, grade books, and notes places within teacher handbooks. Because of the various potential locations and conditions of data, once resurrected from their hidden places, a concoction of data-quality issues must often be resolved before analysis can even hope to begin. The best way to resolve these anomalies is to scrub the data.

Data scrubbing is often overlooked within the process of data collection. The common adage-garbage in, garbage out-is appropriately analogous to data collection and input. Data scrubbing pertains to removing erroneous pieces of information within a data set. These bits and pieces of information are debris that is "dirty" or "contaminated" and it can drastically affect the outcome of data analysis. The source of data impurities can range from inaccurate input; incomplete information; improperly formatted structures; and the most common, duplication of information. Consideration of impure or dirty data will result in a flawed analysis, potentially leading to an inaccurate prognosis and/or diagnosis with implementation of fatal interventions.

Data scrubbing requires no special art or science. A plethora of vendors selling data-cleansing software each project their individual products as the best on the market. Many of these products would not serve the educational environment in an effective manner. For example, the most popular scrubber within today's marketplace is a zip-code scrubber. Zip-code errors have the least impact of all demographic data affecting outcomes within a school environment. In most cases, students attend schools within their zip-code region, and there is no need for cleansing zip codes with expensive software. However, other demographic information entered into educational databases with no consistent rules or guidelines can indeed require future scrubbing. Data scrubbing or cleansing is crucial. The process results in high-quality data that are appropriate for effective data analysis. It removes fallacious marks or debris-datum by datum-either manually or through a series of scripts.

Datum by Datum

In a common database repository, a routine method for data input is rare unless specified guidelines and structures are in place that allow consistency. Information can be filed in a myriad of ways (e.g., name, initials, seating charts, or ID numbers). Sometimes data are stored on electronic media with first and last names or numeric or alphabetical identification. Numeric identification can consist of multiple digits that are left or right justified. For example, a student with an identification number of 45632 is most likely the same student as 00045632, but unlikely to be the same student as 45632000.

Another example is with the use of student addresses as identification. Some computer-database users might enter only the letter "W" to indicate "West" in an address, while others might type the full word in the street name. As simple as this may seem, incorrect entry contaminates the database. Entering the street information as "W. Parker Boulevard" is quite different from "West Parker Boulevard." The former refers to direction (i.e., the street name is "Parker," and the street runs east to west). The latter format, however, reflects a street with a direction running north-south or east-west. Even the "Boulevard" designation presents various formats (e.g., Bl., Blv., Bd., or Blvd.). Consistent and uniform formats upon data entry is critical to data quality. When simple guideline procedures are neglected, and shortcuts, abbreviations, careless entries, or use of common nomenclatures occur, the resulting data are impure and jeopardize proper analysis. With patterns of anomalies in information formatting, data corruption is likely. "Clean" data uphold the integrity of the analysis and effective D3M.

Dirty data are often associated with data-entry shortcuts, misunderstanding, and carelessness due to the personal preferences of the individual keying or otherwise inputting the data. Data entry remains a manual process. It requires a "floor plan" or procedural layout. Bad data can be compared to a stealth virus. It runs in the background, systematically "gnawing away" at the database residing on the hard drive before becoming public. It eventually strikes in an invasive manner and all data within the computer system are corrupted. Dirty data are much like a calcium buildup in a water pipe; sooner or later, the consequence will become evident. Data cleansing must become the responsibility of each individual playing a role in data creation and/or processing at any level. It begins with the instrument used to collect the data. Therefore, schools and districts should have a specific and consistent procedure, such as the following, in place for requesting data:

  • Use a standard form to request data.
  • Be specific about the information sought.
  • State the purpose for the request.
  • Explain how the data are needed.
  • Inform as to how often the data will be requested.

Avoiding the Pitfalls

As school personnel advance their skills in using RDMS, DW, and SIF strategies, data quality becomes increasingly crucial. In fact, RDMS, DW, and SIF strategies require data integrity and a schema of the data fields and layouts. Normally, a unique identification number with a required amount of digits serves as the primary key. Unique numbers throughout a school or school district will promote optimal data quality, data integration, data migration, data profiling, data analysis, and data management. These attributes are of greater importance in the design of a database than in data cleansing. Whether a particular field is labeled "M" for middle name or marital status, or identified as "Gender" or "Sex," is not as critical as how the data response is identified inside the field. For example, is the gender or sex of the respondent "male," "ml," "m," or "1" or "2"? Inconsistencies will present problems and ultimately require scrubbing.

Data impurities within school databases often derive from a single database being used for multiple purposes. When a variety of users are called upon to input important information into a single database, there must be clear guidelines surrounding the data entry. Another way a database can be saturated with impurities is by using hieroglyphics to denote students with special needs or students within special programs. For example, a database used to identify services to English-language learners may stipulate data entry to include "E" or "e" or "El" behind student names or following their student identification numbers or ethnicity. A GATE-identified student may have a "+" or "++" or "G" symbolized behind the response within one of these demographic fields to represent participation in the program.

Data scrubbing may require database personnel or a programmer to write scripts or develop a manual describing the tedious process of removing the described anomalies. Such hieroglyphics serve no effectual purpose in a database and should be avoided. Similarly, carelessness in data entry, either through keying or scanning, requires the adoption of a proven quality-control process. Entering a student address as 5657 instead of 5567, or a telephone number as 909. 555.5657 rather than 909.555.5567, are forms of "littering" the database. Other concerns involve the accuracy of raw scores, percent correct, normal curve equivalents, percentiles, grade equivalents, and standard scores. Statistics are also used in data analysis, and confusion in frequency counts, percentages, modes, means, and standard deviations can often be tracked to dirty data.


Data scrubbing or cleansing is the most critical step in the data-collection process toward effective analysis that ultimately results in D3M. Attention must be paid by all participants in the collection process to the procedures in place at each school site for the standardization of data entry. Scrubbing data is the responsibility of all individuals involved in the process; however, ultimate accountability must always reach the individual who entered the data. A process of validation must begin here to ensure entry of clean data. A system that returns dirty data to their source for proper reentry is recommended. When all involved recognize that data quality is an individual responsibility, and the magnitude of the potential consequences from analysis and decisions based upon dirty data are realized, effective and consistent D3M will result.

Email: Neil Mercurius



Data: Mining with a Mission

For some districts, the current obsession with data grows out of the need to comply with No Child Left Behind and additional accountability-related mandates. For others, it dates way back before the phrase "data-driven decision making" rolled so frequently off the tongues of educators. In either case, there is no

Driven by Data

Question: What is data-driven professional development? A mouthful What you get when you apply data-driven decision making to staff training A technique for planning professional development based on research instead of seat-of-the-pants guesswork All of the above The answer, of course, is d, and although

Data-driven teachers by Dr. Scott McLeod

Data driven decision making (DDDM) is a system of teaching and management practices that gets better information about students into the hands of classroom teachers.  The five major elements are •    good baseline data •    measurable instructional goals •    frequent

How to Perform a Data Makeover

If the data you're storing isn't the right data, or is inconsistent, erroneous, or incomplete, the best analytic tools in the world won't help you make sense of it. We give you four steps to help you "keep it clean."

Data Can Drive Development

Teaching practice can improve if teachers can look at themselves and student data in an objective manner. In most teacher education programs, teachers were not taught to use data to design curriculum and analyze their teaching practices. Teachers need training in both data management and data analysis as well as in

Data Glossary

Don't know your data mart from your dashboard? We explain the essentials.