Repairing an ISCO variable
repairing_isco_input.Rmd
Short Summary
When reading data into R, ISCO variables tend to loose information when read as a number. Be sure to specify that ISCO variables be read as character vectors instead of numeric ones. You can do that for CSV files and Stata formats like this (remember to change the ISCO column names for the ones in your data here):
library(readr)
library(haven)
# For CSV, specify `col_types`
isco_ready <- read_csv(isco_csv, col_types = list(ISCO08 = col_character(), ISCO88 = col_character()))
If you’re reading data using read_dta
for Stata or
read_spss
for SPSS, there’s no workaround since both
functions do not allow to control how each column is read. You should
read it directly and then pass it through repair_isco
to
evaluate any issues.
For example, after reading your data and running
repair_isco
on your ISCO variable, you’ll see a summary of
all problems, if there are any:
x <- repair_isco(isco_ready$ISCO08)
#> ! ISCO variable is not a character. Beware that numeric ISCO variables possibly contain lost data. See https://cimentadaj.github.io/DIGCLASS/articles/repairing_isco_input.html for more details. Converting to a character vector.
#> ℹ ISCO variable has occupations with digits less than 4. Converting to 4 digits.
#> • Converted `110` to `0110`
#> • Converted `210` to `0210`
#> • Converted `310` to `0310`
#> • Converted `100` to `0100`
#> • Converted `200` to `0200`
#> • Converted `300` to `0300`
Repairing ISCO variables
ISCO variables can become corrupted when being read into R. Let me show you a typical example. Suppose we have a CSV file with data on ISCO occupations:
library(DIGCLASS)
library(readr)
isco_csv <- "ISCO08,ISCO88
0110,0110
0210,0110
0310,0110
0100,0110
0200,0110
0300,0110
1000,1000
1100,1100
1110,1100
1111,1110"
This CSV data is in a string in R and we need to read it with
read_csv
. It is quite common for these ISCO columns to be
read as a numeric column when using functions such as
read_csv
or read_dta
(for Stata files). Let’s
see what happens when we read a column as a numeric:
isco_ready <- read_csv(isco_csv, col_types = list(.default = col_double()))
The first 6 rows lost the first 0
. This is quite
dangerous because right now we don’t know if the occupation
110
is actually 0110
or 1100
.
Since we can’t check every single occupation in our dataset to confirm
that it kept all the information of each digit, DIGCLASS
has the function repair_isco
which tells corrects any
issues. Let’s run it:
isco_ready$ISCO08 <- repair_isco(isco_ready$ISCO08)
#> ! ISCO variable is not a character. Beware that numeric ISCO variables possibly contain lost data. See https://cimentadaj.github.io/DIGCLASS/articles/repairing_isco_input.html for more details. Converting to a character vector.
#> ℹ ISCO variable has occupations with digits less than 4. Converting to 4 digits.
#> • Converted `110` to `0110`
#> • Converted `210` to `0210`
#> • Converted `310` to `0310`
#> • Converted `100` to `0100`
#> • Converted `200` to `0200`
#> • Converted `300` to `0300`
The functions tells us some things:
- If the column is a numeric, and warns that we need to turn it into a character
- Warns that there are some occupations with digit less than 4. If
your input ISCO variable is of 3, 2 or 1 digit, you can change it using
the
digits
argument set to another number. - Shows the transformation it made to harmonize all occupations. All transformations will be appending 0’s from the left to the total number of digits.
Ideally, you want to make sure you always read the data as characters because that way we can preserve the original coding of the data. You can read data as characters like this for a CSV:
isco_ready <- read_csv(isco_csv, col_types = list(ISCO08 = col_character(), ISCO88 = col_character()))
isco_ready$ISCO08 <- repair_isco(isco_ready$ISCO08)
Note that no messages were printed now. That’s because the function is correct and nothing needs to be done.