Our expert trainers can help you get the best results.
If you would like any further information please contact one of our training advisors.
Data Cleaning Introduction
Many of my customers find themselves needing to use Excel to analyse their data. I have to admit that the Excel PivotTable® is probably the most effective method for most people to get to grips with what’s going on in their data. Unfortunately, many find that they are downstream of the exporting process and receive Excel files which contain poor data. Ideally, one would expect those responsible for the databases to impose methods of improving data integrity by using data validation techniques. However, I suspect that the database designs are not up for redesign and the ‘Excel Analysts’ just have to get on with whatever is available to them.
In response to regular requests, I am documenting just a few of my more useful ideas which might help in the repetitive reporting cycle.
Preparation
PROBLEMS
Problem 1: Data is in a ‘Comma Separated Values’ file which could have a file extension of .csv.
Problem 2: Dates have not been imported correctly.
Problem 3: Codes have not been imported correctly.
This is often a problem with codes which look like a numeric with leading zeroes. E.g. UK Telephone area codes, serial numbers, project codes all with leading zeros. Also credit card numbers in excess of 15 digits. It is probably the inconsistency of the result which catches the analyst’s attention here as some values import OK whilst others have lost leading zeros. If it looks like a number, Excel will treat it as a number and trim off the leading spaces. Codes with spaces or text do not look like a number to Excel and will be treated as text. The clue here is to remove any alignment, widen the column and look for whether the code is left aligned (text) or right aligned (value). Ideally these should have all been imported as text for consistency.
To reliably import codes from external files use Data > Get External Data > From Text … to launch the “Text Import Wizard”. In Step 3 of 3 of the “Text Import Wizard”, select the column/s containing the code and use the ‘Column data format’ group to set the format to text.
Problem 4: Data has ended up in one column and needs splitting out into separate columns.
Problem 5: Text data not as required.
Problem 6: Data has been entered inconsistently.
Problem 7: Empty / Null Data Values exist in the Data.
A neat clue that this problem exists is that when a column perceived to contain numeric values is placed in the ‘Values’ area of the PivotTable® it would normally default to Sum. In the event that it defaults to ‘Count’ then there is an empty cell/s or Null data value/sand/or a non numeric data valuein the source. This will need investigating. Any “values” stored as text will not be included in the PivotTable totals as they will be treated as zero! The IsNumber() function with a Data > Sort & Filter > Filter applied would be useful here to locate the non-numeric entries.
Problem 8: You have more than 1Million (1,048,576) rows of data.
Discover my favourite tool in Excel, MS Query. Leave the data in the SQL Server, Access Database, Excel file or the .CSV file or wherever and build the PivotTable® directly on this external data using the MS Query Tool. Investigate creating connections for convenient reconnection to source data. Investigate beyond the MS Query Wizard and discover the GUI. Look further and discover the power of SQL. Discover Parameters and how changes in Excel can automatically cause the Query to rerun using value(s) in the Excel sheet to control the actual rows returned. (But not a PivotTable® built on External data).
TIPS
Tip 1: A PivotTable® fails to detect that you have appended data to the source data in Excel.
Use anExcel Table for your PivotTable® source data to ensure that appended data is automatically recognised by the PivotTable® and avoid the problem.
Tip 2: Referencing an Excel List is inconvenient.
Use a Defined Name (Range Name) for your PivotTable® source data so that you can conveniently refer to that data by its name. Use F3 for Paste Names so as to be able to simply pick the name from the list.
Note: Combine this with Tip 1 above. Also note that Defined Names will automatically extend / contract with the Table. The name assigned to a Table (Table1etc) is a Table Name and is not listed in ‘Paste Names’ (F3).
Tip 3: Grouping results by Date in a PivotTable®.
For as long as ALL the data values in a column contain date serial numbers, (See above for checks), then if the date column is included in the Row or Column area of the PivotTable® layout then a right click on one of the dates in the PivotTable® offers ‘Group …’ and displays a date grouping dialog box offering the ability to Group by Years, Quarters, Months, Days, Hours, Minutes and Seconds or combinations.
Tip 4: Grouping results into Corporate Financial Periods in a PivotTable® where the source data is in an Excel list.
Create a suitable Table which defines your own corporate financial periods and use Vlookup() to interpret actual dates in the data to your corporate periods. Place the results alongsidethe original list. Use these columns in your PivotTable to group by your corporate financial periods as required.
Eg: A Suitable List Named: (Say)Corporate_Fin_Periods:
The Vlookup for the financial year: =VLOOKUP(lookup_value,Corporate_Fin_Periods,2,TRUE)
Data cleaning is a necessary activity that preceeds data analysis and this article has aimed to show some of the “gotchas” to watch out for and to provide some helpful tips to ease the process.