Skip to contents

Short Summary

When reading data into R, ISCO variables tend to loose information when read as a number. Be sure to specify that ISCO variables be read as character vectors instead of numeric ones. You can do that for CSV files and Stata formats like this (remember to change the ISCO column names for the ones in your data here):

library(readr)
library(haven)

# For CSV, specify `col_types`
isco_ready <- read_csv(isco_csv, col_types = list(ISCO08 = col_character(), ISCO88 = col_character()))

If you’re reading data using read_dta for Stata or read_spss for SPSS, there’s no workaround since both functions do not allow to control how each column is read. You should read it directly and then pass it through repair_isco to evaluate any issues.

For example, after reading your data and running repair_isco on your ISCO variable, you’ll see a summary of all problems, if there are any:

x <- repair_isco(isco_ready$ISCO08)
#> ! ISCO variable is not a character. Beware that numeric ISCO variables possibly contain lost data. See https://cimentadaj.github.io/DIGCLASS/articles/repairing_isco_input.html for more details. Converting to a character vector.
#> ℹ ISCO variable has occupations with digits less than 4. Converting to 4 digits.
#> • Converted `110` to `0110`
#> • Converted `210` to `0210`
#> • Converted `310` to `0310`
#> • Converted `100` to `0100`
#> • Converted `200` to `0200`
#> • Converted `300` to `0300`

Repairing ISCO variables

ISCO variables can become corrupted when being read into R. Let me show you a typical example. Suppose we have a CSV file with data on ISCO occupations:

library(DIGCLASS)
library(readr)

isco_csv <- "ISCO08,ISCO88
0110,0110
0210,0110
0310,0110
0100,0110
0200,0110
0300,0110
1000,1000
1100,1100
1110,1100
1111,1110"

This CSV data is in a string in R and we need to read it with read_csv. It is quite common for these ISCO columns to be read as a numeric column when using functions such as read_csv or read_dta (for Stata files). Let’s see what happens when we read a column as a numeric:

isco_ready <- read_csv(isco_csv, col_types = list(.default = col_double()))

The first 6 rows lost the first 0. This is quite dangerous because right now we don’t know if the occupation 110 is actually 0110 or 1100. Since we can’t check every single occupation in our dataset to confirm that it kept all the information of each digit, DIGCLASS has the function repair_isco which tells corrects any issues. Let’s run it:

isco_ready$ISCO08 <- repair_isco(isco_ready$ISCO08)
#> ! ISCO variable is not a character. Beware that numeric ISCO variables possibly contain lost data. See https://cimentadaj.github.io/DIGCLASS/articles/repairing_isco_input.html for more details. Converting to a character vector.
#>  ISCO variable has occupations with digits less than 4. Converting to 4 digits.
#> • Converted `110` to `0110`
#> • Converted `210` to `0210`
#> • Converted `310` to `0310`
#> • Converted `100` to `0100`
#> • Converted `200` to `0200`
#> • Converted `300` to `0300`

The functions tells us some things:

  • If the column is a numeric, and warns that we need to turn it into a character
  • Warns that there are some occupations with digit less than 4. If your input ISCO variable is of 3, 2 or 1 digit, you can change it using the digits argument set to another number.
  • Shows the transformation it made to harmonize all occupations. All transformations will be appending 0’s from the left to the total number of digits.

Ideally, you want to make sure you always read the data as characters because that way we can preserve the original coding of the data. You can read data as characters like this for a CSV:

isco_ready <- read_csv(isco_csv, col_types = list(ISCO08 = col_character(), ISCO88 = col_character()))
isco_ready$ISCO08 <- repair_isco(isco_ready$ISCO08)

Note that no messages were printed now. That’s because the function is correct and nothing needs to be done.