Microsoft Excel Training Courses

Data Cleaning in Excel before Analysing Data

Data Cleaning Introduction

Many of my customers find themselves needing to use Excel to analyse their data. I have to admit that the Excel PivotTable® is probably the most effective method for most people to get to grips with what’s going on in their data. Unfortunately, many find that theyare downstream of the exporting process and receive Excel files which contain poor data. Ideally, one would expect those responsible for the databases to impose methods of improving data integrity by using data validation techniques. However, I suspect that the database designs are not up for redesign and the ‘Excel Analysts’ just have to get on with whatever is available to them.

In response to regular requests, I am documenting just a few of my more useful ideas which might help in the repetitive reporting cycle.

Preparation

  • Is the data in a ‘Good Format’?
    • Surrounded by blank rows and columns.
    • No completely blank rows.
    • Only one row at the top which is fully populated with unique column descriptions. The ‘Header Row’.
    • Test by selecting a cell in the data and selecting the current region by any of the following methods:
      • Ctrl + * (From the numeric keypad).
      • Ctrl + (Shift +8) from the top of the keyboard (*8 Key).
      • F5 > Special…> Current Region.
    • Scroll to check that all the data has been selected and that no adjacent data has been selected. (Trick: Shift + TAB will go from the top left of a selection to the bottom right of the selection. TAB to return without dropping the selection.)

PROBLEMS

Problem 1: Data is in a ‘Comma Separated Values’ file which could have a file extension of .csv.

  • I would strongly advise against just opening text files as in all probability it will import the data without giving you any control over the importing process. In particular dates and codes will become corrupted.
  • Instead, open a new workbook and use Data >Get External Data >From Text to launch the “Text Import Wizard”.       This method can be used on any file irrespective of the file extension as long as the file contains text.

Problem 2: Dates have not been imported correctly.

  • The root cause of this problem is that the dates have been imported from asource where they are in a different format to that on the computer running Excel. Excel uses the Control Panel >Region and Language Settings to define its native date format. E.g.UK or US. The worst case result is that an English looking date coming from a US formatted source will be accepted but corrupted. E.g. 6 July 2009 coming through as 7/6/2009 in US format might well end up as the 7 June 2009 whilst 23 July 2009 coming through as 7/23/2009 will (Excel 2007 onwards) be treated as text and display as 7/23/2009.This inconsistency has now ruined the data and it is pointless trying to salvage the situation from within Excel. The data needs to be re-imported for reliability.
  • To reliably import dates from external text files use Data >Get External Data >From Text … to launch the “Text Import Wizard”. In Step 3 of 3 of the “Text Import Wizard”, select the column/s containing the date and use the ‘Column data format’ group to set the date format to suit the format of the external data source. E.g. MDY

Excel import dates

  • To gain confidence that dates have arrived in the workbook correctly, try removing the number formatting (Home Tab >Editing> Clear > Clear Formats) and expect to see the dates as date serial numbers aligned on the right. Date serial numbers will be in the order of 42,000, being a count of the number of days from 1 Jan 1900. (27 Dec 2014 is 42,000)
  • See below for a useful tip on ‘Grouping by date’ in a PivotTable®.

Problem 3: Codes have not been imported correctly.

This is often a problem with codes which look like a numeric with leading zeroes. E.g. UK Telephone area codes, serial numbers, project codes all with leading zeros. Also credit card numbers in excess of 15 digits. It is probably the inconsistency of the result which catches the analyst’s attention here as some values import OK whilst others have lost leading zeros. If it looks like a number, Excel will treat it as a number and trim off the leading spaces. Codes with spaces or text do not look like a number to Excel and will be treated as text. The clue here is to remove any alignment, widen the column and look for whether the code is left aligned (text) or right aligned (value). Ideally these should have all been imported as text for consistency.

To reliably import codes from external files use Data > Get External Data > From Text … to launch the “Text Import Wizard”. In Step 3 of 3 of the “Text Import Wizard”, select the column/s containing the code and use the ‘Column data format’ group to set the format to text.

Excel import codes

Problem 4: Data has ended up in one column and needs splitting out into separate columns.

  • Consider reviewing the method of how the data ended up like this and try to avoid it in the first place. If there is alternative then use the ‘Text to Columns Wizard’ (From Data > Data Tools > Text to Columns.)

Problem 5: Text data not as required.

  • Remove double spaces, leading spaces and trailing spaces with Trim().
  • Remove non printable characters with Clean().
  • Split data out of cells with either :
    • ‘Text to Columns Wizard’ (From Data > Data Tools > Text to Columns.)
    • Text functions like Left(), Mid(), Right()

Problem 6: Data has been entered inconsistently.

  • Consider creating a “Correction Table” in which the first column lists all the possible variations which are encountered in the data whilst the second column contains the preferred value.       Then insert an extra column in the original list and lookup the problem codes in this ‘correction table’ using VLOOKUP(lookup_value,table_array,col_index_num,range_lookup) where range_lookup is FALSE for an exact match.
    • Methods for creating a list of unique entries for the correction table from the actual data could include:
      • Advanced Filtering (Data > Sort & Filter > Advanced).
        • Filter the list, In-place.
        • ListRange: to include only the cells in the column in which in which you are looking for unique values. Include the heading.
        • CriteriaRange: to include only the cells in the column in which you are looking for unique values. Include the heading.
        • Check ‘Unique records only’.
        • This crude use of Advanced Filter will hide the necessary rows.
      • Alternatively, create a PivotTable of just the column containing the actual data from which you require unique entries.
    • Copy the data to your “Correction Table”.
  • Keep and update this table for reuse on the next update cycle.

Problem 7: Empty / Null Data Values exist in the Data.

A neat clue that this problem exists is that when a column perceived to contain numeric values is placed in the ‘Values’ area of the PivotTable® it would normally default to Sum. In the event that it defaults to ‘Count’ then there is an empty cell/s or Null data value/sand/or a non numeric data valuein the source. This will need investigating. Any “values” stored as text will not be included in the PivotTable totals as they will be treated as zero! The IsNumber() function with a Data > Sort & Filter > Filter applied would be useful here to locate the non-numeric entries.

Problem 8: You have more than 1Million (1,048,576) rows of data.

Discover my favourite tool in Excel, MS Query. Leave the data in the SQL Server, Access Database, Excel file or the .CSV file or wherever and build the PivotTable® directly on this external data using the MS Query Tool. Investigate creating connections for convenient reconnection to source data. Investigate beyond the MS Query Wizard and discover the GUI. Look further and discover the power of SQL. Discover Parameters and how changes in Excel can automatically cause the Query to rerun using value(s) in the Excel sheet to control the actual rows returned. (But not a PivotTable® built on External data).

 

TIPS

Tip 1: A PivotTable® fails to detect that you have appended data to the source data in Excel.

Use anExcel Table for your PivotTable® source data to ensure that appended data is automatically recognised by the PivotTable® and avoid the problem.

Tip 2: Referencing an Excel List is inconvenient.

Use a Defined Name (Range Name) for your PivotTable® source data so that you can conveniently refer to that data by its name. Use F3 for Paste Names so as to be able to simply pick the name from the list.

Note: Combine this with Tip 1 above. Also note that Defined Names will automatically extend / contract with the Table. The name assigned to a Table (Table1etc) is a Table Name and is not listed in ‘Paste Names’ (F3).

Tip 3: Grouping results by Date in a PivotTable®.

For as long as ALL the data values in a column contain date serial numbers, (See above for checks), then if the date column is included in the Row or Column area of the PivotTable® layout then a right click on one of the dates in the PivotTable® offers ‘Group …’ and displays a date grouping dialog box offering the ability to Group by Years, Quarters, Months, Days, Hours, Minutes and Seconds or combinations.

Group results by date in Excel PivotTable

Tip 4: Grouping results into Corporate Financial Periods in a PivotTable® where the source data is in an Excel list.

Create a suitable Table which defines your own corporate financial periods and use Vlookup() to interpret actual dates in the data to your corporate periods. Place the results alongsidethe original list. Use these columns in your PivotTable to group by your corporate financial periods as required.

Eg: A Suitable List Named: (Say)Corporate_Fin_Periods:

Excel financial periods

The Vlookup for the financial year:

=VLOOKUP(lookup_value,Corporate_Fin_Periods,2,TRUE)

Data cleaning is a necessary activity that preceeds data analysis and this article has aimed to show some of the “gotchas” to watch out for and to provide some helpful tips to ease the process.

Stuart Tayler
26 January 2016