Vetting data

From Wiki
Revision as of 16:55, 5 September 2017 by EllieBremer (Talk | contribs)

Jump to: navigation, search

Vetting is the process of checking that the data that we import to IFs are of good quality. Even when we import the data through an automated application there could be errors or missing data in the imported series.

One of the major reason an entire country (or a number of countries) could be missed is the mismatch in country concordance table. Make sure the IFs country concordance list (i.e., the relevant column in the Country Translation table in IFs.mdb) is correct and up to date. Make sure none of the year columns are missed. Sometime the year column is missed because the data is formatted as text and not as numbers. One more general issue is around the units. Make sure you perform any unit conversion required before you import the data. Comparing the new data with the current IFs series will give you an idea on any unit conversion that might be needed.

It is not recommended that you vet every single country. You can ensure data quality by following certain procedures:

  1. For an updated table, open the existing and updated table side by side. Most socio-economic data do not change very fast and we usually update data on a regular basis. If you see big difference on the most recent field you should be concerned. Sort the "MostRecent" column ascending and descending and compare the countries that are highest and lowest in the two tables.
  2. Both for updated and new data, check the countries that you are familiar with or for a country for which we can make a guess. For example, the 'ranking' comparison described above can be done as a 'ranking'  check for new data (rank the "Most Recent" or "Earliest") for data series on issue areas (like poverty or governance or education) for which we have a good sense of global ranking.
  3. For any series, check large and important countries like USA, China or Germany. Check a couple of very poor and very small countries. Check any other country that you are very familiar with.
  4. Check countries with similar name (like the two Congos and two Koreas)
  5. If you are using template spreadsheet and manual copy and paste, check for countries at the bottom and top of the spreadsheet (don't miss Zimbabwe).
  6. Check for 0's. We take no data as an empty cell. If we have 0's that need to be data and must be a feasible value (e.g., GDP cannot be 0).
  7. Do a reality check on variable definition: Make sure percentages are below 100 (there are cases when percentages can be above 100, e.g., gross enrollment rate).
  8. More reality check on value: GDP growth rate of more than 10% should raise a flag (check to make sure); GDP growth rate of more than 5% for a rich country; calorie per capita of more than 3000 for a poor country; calorie per capita of more than 4000?
  9. When data is in Excel, line graphs for a country is a way to quickly check for transients
  10. If you use ImportXLS do a scatter plot on most recent (or some other years) column for old and new data

Vetting Tool: Compare Imported Data Form in IFs

IFs now has a new feature to do some basic initial vetting, checking things like: are there any zeroes or negatives in the imported data, are there big spikes in the extracted data, are any of the updated data value wildly different from existing data.

This feature (screen) can also be used to merge year columns between updated and existing Access tables, a blending requirement.

The feature can be reached through IFs menu options:

IFs -> MainMenu->Extended Features -> Manage Country Data->Historic Data File Processing-> Compare Imported Data

The form (screen) that opens is shown in the figure below. The process starts by opening the Access (.Mdb) file containing data series tables and a dictionary table with a list of new and/or updated data variables. "File" menu option in this form can be used to open a .mdb file, e.g. IFsDataImport.mdb or IFsDataImportBatch.mdb or any such import files passed from data extraction and renamed for the sake of filing convenience.

As long as the opened .mdb contains a bunch of data tables and a datadict with a list of those tables, the top grid will fill up with the list from the datadict. The second column in the top grid will be filled using master datadict and the cells in this column will remain empty for new extractions.

Once you double click on any of the rows in the top grid, data for the new and the existing table (for updates) will show up in the middle grids. Any zero or negative will be marked red. You can use buttons below the grids to clear out zeroes and/or negatives from the table containing extracted data.


The button titled Mark Transients marks any spike found across columns (for the same country in the same table). The marking algorithm uses a default threshold of 10% (up or down) that can be changed by using the Marking Threshold box.

Another button titled Mark Inconsistencies mark any big discrepancy found between same data points (same country, same year) in the two tables, imported and existing. This marking algorithm also uses a default threshold of 10% (up or down) that can be changed by using the Marking Threshold box.

You should use the Refresh Display button as you switch from one kind (or threshold) of marking to another.

"Bring Years From Existing Table to New Table" button serves the blending purpose of merging complete year columns that are missing in the newly imported table (this button can, in fact, should be used post extraction prior to vetting).

"Update Data Dictionary for Imported Table" button on the bottom left takes to a data dictionary form and allows you to vet/edit the data dictionary info filled during data extraction.

The big button on the bottom right titled "Merge Imported Data to Master Database" should not be used during this stage of vetting. This button adds (or overwrites) imported tables to IFs\Data\IFsHistSeries.Mdb and merges datadict rows to the master copy of datadict (IFs\Data\Datadict.mdb). This button should only be used by the person who is in charge of merging the data to IFs database.