In the midst of a historic uprooting of the job market, questions related to salary and fair pay have never been more relevant or more prominent in daily conversation. “The Great Resignation” associated with the COVID-19 pandemic has emboldened the American workforce to shed the outdated taboo of keeping one’s salary private and to demand more from employers.
Consideration of changing demographics in the labor market is crucial to understanding the current cultural shift in pay transparency. As women continue to outnumber men in college degrees awarded and enter the workforce in greater numbers, more light is shed on the gender pay gap in white-collar industries. Similarly, the racial and generational makeup of the office workplace is rapidly moving away from the middle-aged, white “standard” of the 20th century. As historically disadvantaged demographic groups rise to greater prominence in historically white male workplaces, urgency in uncovering pay disparities increases.
The tech industry is an obvious target to examine such disparities. Highly desirable, high-paying jobs that tout benefits like stock options, education reimbursement and the ability to work remotely attract the best and brightest to the field, regardless of race, gender, or age. Of course the modern tech industry has its own historical demographic disparities, skewing heavily toward White and Asian men. However, the high demand for skilled tech workers and increasing levels of interest and access to tech education across all genders and races is rapidly shifting the face of the industry.
Data from The U.S. Bureau of Labor Statistics highlights shifts in the racial makeup of Computer and Mathematical Professions in recent years. Since 2018, the yearly percent increase in Black employed persons in the field has more than doubled from 4% to 9%, while both White workers and Asian workers have seen a decline in growth. Racial groups that have historically been excluded from the tech field are gaining momentum in representation as the industry continues to expand.
In their 2021 Gender and Pay Gap Report, which analyzes pay disparities in the US across all industries via crowdsourced data, PayScale found that, without adding any control variables, women make 82𝇍 for every dollar earned by men. After adding control variables, women made 98𝇍 for every dollar earned by men, leaving a 2% difference attributable purely to discrimination based on gender.
PayScale’s findings on the racial wage gap show that, with or without control of demographics, both men and women of most races earn less than white men. Interestingly, when controlling for external factors, Asian men and women earn more than any group. The figures below taken from payscale.com illustrate these pay disparities:
Levels.fyi is a website founded in 2017 as a place for tech industry professionals around the world to anonymously share detailed compensation information. In 2020, levels.fyi began collecting race, gender, and education information from users along with salary information.
Using data from levels.fyi, I am interested in answering the question: can any of the variance in compensation in the tech industry be explained by racial and gender differences? If so, how much of this variance can be attributed to differences in years of experience, job title, educational attainment, and cost of living between genders and racial groups?
My hypothesis going into this analysis is that race and gender will remain significant predictors of total annual compensation when controlling for years of experience, education, and cost of living. Based on the data from PayScale, I predict that men in the levels.fyi dataset will be compensated on average approximately 2% more than women after adding controls. I predict that Asian posters will be compensated on average approximately 2% more than White posters, while Black posters will be compensated on average approximately 6% less than White posters, and Hispanic posters approximately 4% less than White posters.
My dependent variable will be total annual compensation, with gender, race, education, total years of experience, years at the current company and cost of living index as independent variables.
The sample that I will use comes from a comprehensive dataset of scraped salary postings from levels.fyi. The data was posted to Kaggle.com and is available here. I have limited my analysis to jobs in the US and removed NA values for our target independent variables, gender and race. I have also removed records with total yearly compensation equal to 0. I have joined this data to a separate table with cost of living index values by US state.
Total Yearly Compensation (totalyearlycompensation): Total yearly compensation in US dollars; sum of base salary, stock grant value and bonus. Numeric.
Base Salary (basesalary): Base salary in US dollars. Values of $0 have been removed. Numeric.
Stock Grant Value (stockgrantvalue): Stock grant value in US dollars. Numeric.
Bonus (bonus): Bonus in US dollars. Numeric.
Gender (gender): Gender can be “Male”, “Female” or “Other”. NA values have been removed. Nominal.
Race (Race): Race can be “Asian”, “Black”, “Hispanic”, “Two or More”, or “White”. NA values have been removed. Nominal.
Years of Experience (yearsofexperience): Total years of work experience. Integer.
Years at Company (yearsatcompany): Years working at the current company. Integer.
Education (Education): Education can be “Highschool”, “Some College”, “Bachelor’s Degree”, “Master’s Degree” or “PhD”. NA values have been removed. Nominal
Year (Year): All records in our sample were posted in 2020 or 2021. Nominal.
Title (title): Job title can be “Business Analyst”, “Data Scientist”, “Hardware Engineer”, “Human Resources”, “Management Consultant”, “Marketing”, “Mechanical Engineer”, “Product Designer”, “Product Manager”, “Recruiter”, “Sales”, “Software Engineer”, “Software Engineering Manager”, “Solution Architect”, and “Technical Program Manager”. Nominal.
Cost of Living Index (Index): Cost of living index by US State comes from The Council for Community and Economic Research from Q3 of 2021. The average cost of living index for the US is 100. States with cost of living greater than 100 have above average cost of living, those below 100 have below average cost of living. Numeric.
n | mean | sd | median | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|
yearsofexperience | 16961 | 7.214197e+00 | 6.052711e+00 | 5.0 | 0.0 | 45.0 | 45.0 | 1.2589765 | 1.701493 | 0.0464755 |
yearsatcompany | 16961 | 2.796592e+00 | 3.473673e+00 | 2.0 | 0.0 | 40.0 | 40.0 | 2.4861424 | 8.955334 | 0.0266725 |
totalyearlycompensation | 16961 | 2.229161e+05 | 1.340280e+05 | 193000.0 | 15000.0 | 4980000.0 | 4965000.0 | 6.7060609 | 166.234111 | 1029.1288232 |
basesalary | 16961 | 1.497101e+05 | 4.948135e+04 | 145000.0 | 10000.0 | 900000.0 | 890000.0 | 2.9145908 | 25.947537 | 379.9407214 |
stockgrantvalue | 16961 | 5.164124e+04 | 7.817794e+04 | 25000.0 | 0.0 | 954000.0 | 954000.0 | 3.4566399 | 19.170630 | 600.2863735 |
bonus | 16961 | 2.068375e+04 | 2.658560e+04 | 15000.0 | 0.0 | 900000.0 | 900000.0 | 6.9435817 | 125.890981 | 204.1365625 |
Index | 16961 | 1.265840e+02 | 2.163128e+01 | 132.5 | 85.1 | 185.6 | 100.5 | -0.3916221 | -1.454499 | 0.1660950 |
The average number of years of experience out of 16,961 posters in the US is 7.2 years. The skew and kurtosis of this variable are relatively low, meaning that it is approximately normally distributed. The average number of years at the poster’s current company is 2.8. The skew and kurtosis for this variable are higher than they were for total years of experience - this is acceptable, as normality of independent variables is not an assumption for multiple linear regression.
The average total annual compensation for our sample is $222,916, average base salary is $149,7101, average stock grant value is $51,641 and bonus is $20,683. Total annual compensation, base salary, stock grant value and bonus are all heavily skewed in our dataset. Again, this is acceptable. Log transforming Total Yearly Compensation in our models will also increase the normality of this variable.
The average cost of living index for our sample is 127, meaning that most of the posters to levels.fyi in our sample live in states with a higher than average cost of living. Skew and kurtosis for this variable are relatively low, meaning that it is approximately normally distributed.
usa_level_col$education | n | percent |
---|---|---|
Bachelor’s Degree | 8231 | 48.5% |
Highschool | 193 | 1.1% |
Master’s Degree | 7428 | 43.8% |
PhD | 827 | 4.9% |
Some College | 282 | 1.7% |
A Bachelor’s degree is the most common level of Education in our sample, followed very closely by Master’s degree. PhDs account for about 5% of the sample. There are very few cases with education below the level of a Bachelor’s degree.
usa_level_col$gender | n | percent |
---|---|---|
Female | 3379 | 19.9% |
Male | 13497 | 79.6% |
Other | 85 | 0.5% |
The majority of cases in our sample identify as Male.
usa_level_col$race | n | percent |
---|---|---|
Asian | 8682 | 51.2% |
Black | 595 | 3.5% |
Hispanic | 908 | 5.4% |
Two Or More | 647 | 3.8% |
White | 6129 | 36.1% |
The most common races in our sample are Asian and White, with Black, Hispanic and “Two or More” accounting for less than 13% of cases.
usa_level_col$year | n | percent |
---|---|---|
2020 | 6369 | 37.6% |
2021 | 10592 | 62.4% |
Most of the data in our sample comes from 2021.
usa_level_col$title | n | percent |
---|---|---|
Business Analyst | 361 | 2.1% |
Data Scientist | 717 | 4.2% |
Hardware Engineer | 656 | 3.9% |
Human Resources | 151 | 0.9% |
Management Consultant | 368 | 2.2% |
Marketing | 323 | 1.9% |
Mechanical Engineer | 234 | 1.4% |
Product Designer | 518 | 3.1% |
Product Manager | 1254 | 7.4% |
Recruiter | 190 | 1.1% |
Sales | 155 | 0.9% |
Software Engineer | 10372 | 61.2% |
Software Engineering Manager | 803 | 4.7% |
Solution Architect | 302 | 1.8% |
Technical Program Manager | 557 | 3.3% |
Software Engineer is by far the most common job title in our sample, accounting for over 60% of cases. Product Manager is the next most common job title, accounting for 7.4% of cases.
As an initial model, I will run the following three simple multiple linear regressions:
ln(Total Annual Compensation) = β0 + β1(Gender)
ln(Total Annual Compensation) = β0 + β2(Race)
ln(Total Annual Compensation) = β0 + β1(Gender) + β2(Race)
ln(Total Annual Compensation) = β0 + β1(Gender) + β2(Race) + β1(Gender)x β2(Race)
Note 1: Reference groups for the gender and race variables have been set to “Male” and “White”.
Note 2: The dependent variable in the model is the natural log of total annual compensation. Log transforming variables with wide ranges such as income is a standard practice and will allow easier interpretation of model coefficients.
Note 3: Gender and Race have been written as single variables for simplicity, however because they are categorical variables, they will be dummy-coded in the model and will be represented by several beta values (one for each unique value, except for the reference group).
Model 1a | Model 1b | Model 1c | Model 1d | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 12.21 | 12.20 – 12.22 | <0.001 | 12.18 | 12.17 – 12.20 | <0.001 | 12.20 | 12.19 – 12.22 | <0.001 | 12.20 | 12.19 – 12.22 | <0.001 |
gender [Female] | -0.11 | -0.13 – -0.09 | <0.001 | -0.11 | -0.13 – -0.09 | <0.001 | -0.11 | -0.14 – -0.08 | <0.001 | |||
gender [Other] | -0.01 | -0.11 – 0.10 | 0.886 | 0.01 | -0.09 – 0.11 | 0.862 | -0.19 | -0.35 – -0.03 | 0.020 | |||
Race [Asian] | 0.03 | 0.02 – 0.05 | <0.001 | 0.04 | 0.02 – 0.05 | <0.001 | 0.04 | 0.02 – 0.06 | <0.001 | |||
Race [Black] | -0.18 | -0.22 – -0.14 | <0.001 | -0.17 | -0.21 – -0.13 | <0.001 | -0.18 | -0.23 – -0.13 | <0.001 | |||
Race [Hispanic] | -0.07 | -0.10 – -0.04 | <0.001 | -0.07 | -0.11 – -0.04 | <0.001 | -0.08 | -0.12 – -0.04 | <0.001 | |||
Race [Two Or More] | -0.01 | -0.05 – 0.03 | 0.533 | -0.01 | -0.05 – 0.03 | 0.616 | -0.05 | -0.09 – -0.00 | 0.042 | |||
gender [Female] * Race [Asian] |
-0.02 | -0.06 – 0.02 | 0.347 | |||||||||
gender [Other] * Race [Asian] |
0.16 | -0.18 – 0.50 | 0.349 | |||||||||
gender [Female] * Race [Black] |
0.03 | -0.06 – 0.13 | 0.465 | |||||||||
gender [Other] * Race [Black] |
0.47 | -0.22 – 1.16 | 0.181 | |||||||||
gender [Female] * Race [Hispanic] |
0.01 | -0.08 – 0.10 | 0.797 | |||||||||
gender [Other] * Race [Hispanic] |
0.64 | 0.19 – 1.09 | 0.006 | |||||||||
gender [Female] * Race [Two Or More] |
0.13 | 0.03 – 0.22 | 0.010 | |||||||||
gender [Other] * Race [Two Or More] |
0.37 | 0.14 – 0.60 | 0.002 | |||||||||
Observations | 16961 | 16961 | 16961 | 16961 | ||||||||
R2 / R2 adjusted | 0.008 / 0.008 | 0.008 / 0.008 | 0.016 / 0.016 | 0.018 / 0.017 |
The preliminary models use only our target independent variables, gender and race, to predict total annual compensation. Model 1a (gender) has an R^2 value of .008, meaning that .8% of the variance in total annual compensation can be explained by differences in gender. Model 1b (race) also has an R^2 value of .008, meaning that .8% of the variance in total annual compensation can be explained by differences in race. Model 1c (gender and race) has an R^2 value of .016, meaning that 1.6% of the variance in total annual compensation can be explained by differences in gender and race among posters. Although there are minor changes in coefficient values for Race [Asian] and Race [Black] between models 1b and 1c, because the R^2 value of model 1c is exactly equal to the combined R^2 values of models 1a and 1b, we can infer that race and gender are both contributing their own explanatory power to the combined model, and neither variable is mediating the effect of the other in a significant way.
In Model 1c, we can see that net of gender, Black and Hispanic posters had a total yearly compensation on average 17% and 7% less than White posters, respectively. Asian posters had on average 4% higher yearly compensation than White posters. These relationships are all highly significant, with p values below the .001 threshold.
Model 1d was included to test for a possible interaction between the gender and race variables. Adding the interaction term increased the model’s explanatory power by only .1% (unadjusted) - .2% (adjusted). Groups that differed significantly from White men to an extent beyond what can be explained by either the race or gender variables separately were Hispanic posters of gender “Other”, female posters of race “Two or More”, and posters of gender “Other” and race “Two or More.” All coefficients for these groups were positive, meaning that membership in one of these combinations of groups increases total annual compensation beyond what would be expected using gender and race separately as predictors. Because none of these coefficients were significant at the p < .001 threshold and they add relatively little explanatory power to the model, I am choosing to omit an interaction term between gender and race in subsequent models.
Without controlling for additional variables, we cannot say with much certainty what is driving these relationships. It could be that the men in our dataset have much more experience than the women in our dataset, and after controlling for years of experience, the magnitude of the coefficient for gender [Female] will decrease. To account for some of these possibilities, I will add the following control variables to our model: Years of Experience, Years at Company, Education, Title, Cost of Living Index.
Next I will perform a hierarchical linear regression to determine the statistical significance of variance in compensation explained by gender and race. The first step will include only control variables as independent variables to predict total annual compensation. In subsequent models, gender and race will be added as independent variables to the initial model. F-tests between the models will test for the significance of any additional variance explained as gender and race are added.
The models I will test are written below:
Model 2a: ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index)
Model 2b: ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender)
Model 2c: ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender) + β7(Race)Model 2a | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 10.63 | 10.58 – 10.68 | <0.001 |
yearsofexperience | 0.04 | 0.04 – 0.04 | <0.001 |
yearsatcompany | -0.01 | -0.02 – -0.01 | <0.001 |
education [Highschool] | 0.01 | -0.04 – 0.06 | 0.744 |
education [Master’s Degree] |
0.08 | 0.07 – 0.09 | <0.001 |
education [PhD] | 0.34 | 0.31 – 0.37 | <0.001 |
education [Some College] | -0.09 | -0.13 – -0.04 | <0.001 |
title [Data Scientist] | 0.35 | 0.30 – 0.40 | <0.001 |
title [Hardware Engineer] | 0.27 | 0.22 – 0.32 | <0.001 |
title [Human Resources] | 0.10 | 0.03 – 0.17 | 0.008 |
title [Management Consultant] |
0.12 | 0.07 – 0.18 | <0.001 |
title [Marketing] | 0.20 | 0.15 – 0.26 | <0.001 |
title [Mechanical Engineer] |
0.09 | 0.03 – 0.15 | 0.004 |
title [Product Designer] | 0.35 | 0.30 – 0.40 | <0.001 |
title [Product Manager] | 0.45 | 0.40 – 0.49 | <0.001 |
title [Recruiter] | 0.07 | 0.01 – 0.14 | 0.031 |
title [Sales] | 0.33 | 0.26 – 0.40 | <0.001 |
title [Software Engineer] | 0.40 | 0.36 – 0.44 | <0.001 |
title [Software Engineering Manager] |
0.63 | 0.58 – 0.67 | <0.001 |
title [Solution Architect] |
0.31 | 0.25 – 0.37 | <0.001 |
title [Technical Program Manager] |
0.34 | 0.29 – 0.39 | <0.001 |
Index | 0.01 | 0.01 – 0.01 | <0.001 |
Observations | 16961 | ||
R2 / R2 adjusted | 0.411 / 0.410 |
The first step in the hierarchical linear model with control variables Years of Experience, Years at Company, Education, Title, and Index explains 41% of the variance in total annual compensation in our sample. This is a large amount of explanatory power, and almost all of the coefficients associated with the variables are significant with a p value below the .001 threshold.
As could be predicted, net of other variables, as total years of experience increase, so does total yearly compensation, by 4% per year of experience. Interestingly, posters with a longer tenure at the current company on average see total yearly compensation for new positions decrease by 1% for every year at the company, net of other variables.
Due to their omission, we can infer that the reference group for education is Bachelor’s degree holders, and the reference group for title is Business Analyst.
Coefficients for the dummy-coded education variables make sense: net of other variables, posters with a Master’s degree are compensated on average 8% higher than those with a Bachelor’s degree. Posters with a PhD are compensated a whopping 34% higher on average than those with a Bachelor’s degree. There is a high degree of variation in compensation between titles, which would also be expected.
Net of other variables, as the cost of living index for the state in which the job was taken increases, so does total annual compensation, by 1% per index point on average.
Next I will add gender as an independent variable to this model to determine its effect and significance in explaining total annual compensation when controlling for years of experience, years at the company, education, title, and cost of living.
Model 2a | Model 2b | |||||
---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 10.63 | 10.58 – 10.68 | <0.001 | 10.65 | 10.59 – 10.70 | <0.001 |
yearsofexperience | 0.04 | 0.04 – 0.04 | <0.001 | 0.04 | 0.04 – 0.04 | <0.001 |
yearsatcompany | -0.01 | -0.02 – -0.01 | <0.001 | -0.01 | -0.02 – -0.01 | <0.001 |
education [Highschool] | 0.01 | -0.04 – 0.06 | 0.744 | -0.00 | -0.05 – 0.05 | 0.981 |
education [Master’s Degree] |
0.08 | 0.07 – 0.09 | <0.001 | 0.08 | 0.07 – 0.09 | <0.001 |
education [PhD] | 0.34 | 0.31 – 0.37 | <0.001 | 0.34 | 0.31 – 0.37 | <0.001 |
education [Some College] | -0.09 | -0.13 – -0.04 | <0.001 | -0.09 | -0.13 – -0.04 | <0.001 |
title [Data Scientist] | 0.35 | 0.30 – 0.40 | <0.001 | 0.34 | 0.30 – 0.39 | <0.001 |
title [Hardware Engineer] | 0.27 | 0.22 – 0.32 | <0.001 | 0.26 | 0.21 – 0.30 | <0.001 |
title [Human Resources] | 0.10 | 0.03 – 0.17 | 0.008 | 0.12 | 0.05 – 0.19 | 0.001 |
title [Management Consultant] |
0.12 | 0.07 – 0.18 | <0.001 | 0.12 | 0.07 – 0.18 | <0.001 |
title [Marketing] | 0.20 | 0.15 – 0.26 | <0.001 | 0.22 | 0.16 – 0.27 | <0.001 |
title [Mechanical Engineer] |
0.09 | 0.03 – 0.15 | 0.004 | 0.08 | 0.02 – 0.14 | 0.014 |
title [Product Designer] | 0.35 | 0.30 – 0.40 | <0.001 | 0.36 | 0.31 – 0.41 | <0.001 |
title [Product Manager] | 0.45 | 0.40 – 0.49 | <0.001 | 0.45 | 0.40 – 0.49 | <0.001 |
title [Recruiter] | 0.07 | 0.01 – 0.14 | 0.031 | 0.09 | 0.03 – 0.16 | 0.006 |
title [Sales] | 0.33 | 0.26 – 0.40 | <0.001 | 0.33 | 0.26 – 0.40 | <0.001 |
title [Software Engineer] | 0.40 | 0.36 – 0.44 | <0.001 | 0.39 | 0.35 – 0.43 | <0.001 |
title [Software Engineering Manager] |
0.63 | 0.58 – 0.67 | <0.001 | 0.62 | 0.57 – 0.66 | <0.001 |
title [Solution Architect] |
0.31 | 0.25 – 0.37 | <0.001 | 0.30 | 0.24 – 0.35 | <0.001 |
title [Technical Program Manager] |
0.34 | 0.29 – 0.39 | <0.001 | 0.34 | 0.29 – 0.39 | <0.001 |
Index | 0.01 | 0.01 – 0.01 | <0.001 | 0.01 | 0.01 – 0.01 | <0.001 |
gender [Female] | -0.07 | -0.08 – -0.05 | <0.001 | |||
gender [Other] | -0.03 | -0.11 – 0.05 | 0.447 | |||
Observations | 16961 | 16961 | ||||
R2 / R2 adjusted | 0.411 / 0.410 | 0.413 / 0.413 |
We can see that the coefficient for gender [Female] has decreased in magnitude from -.11 in the preliminary model to -.07 after adding control variables, though it is still highly significant at the same p value threshold of .001. This means that after controlling for years of experience, years at the company, education, job title and cost of living index, being female still results in a 7% decrease in total yearly compensation on average compared to being male. There is no significant difference in total annual compensation between posters who identified as men and those who identified as “other” gender, likely because this group was so small. It will be interesting to see what effects might come to light as more posters who identify neither as men nor women continue to report compensation on levels.fyi. The addition of gender to the model added only .2% of explanatory power overall.
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
16939 | 2348.140 | NA | NA | NA | NA |
16937 | 2337.514 | 2 | 10.62567 | 38.49539 | 0 |
The F test between Model 2a and Model 2b confirms that the addition of gender provided a significant amount of explanatory power to the model.
Now I will add race as an independent variable to the model on top of gender to test if any additional variance can be explained.
Model 2a | Model 2b | Model 2c | |||||||
---|---|---|---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 10.63 | 10.58 – 10.68 | <0.001 | 10.65 | 10.59 – 10.70 | <0.001 | 10.65 | 10.60 – 10.70 | <0.001 |
yearsofexperience | 0.04 | 0.04 – 0.04 | <0.001 | 0.04 | 0.04 – 0.04 | <0.001 | 0.04 | 0.04 – 0.04 | <0.001 |
yearsatcompany | -0.01 | -0.02 – -0.01 | <0.001 | -0.01 | -0.02 – -0.01 | <0.001 | -0.01 | -0.02 – -0.01 | <0.001 |
education [Highschool] | 0.01 | -0.04 – 0.06 | 0.744 | -0.00 | -0.05 – 0.05 | 0.981 | 0.00 | -0.05 – 0.06 | 0.896 |
education [Master’s Degree] |
0.08 | 0.07 – 0.09 | <0.001 | 0.08 | 0.07 – 0.09 | <0.001 | 0.07 | 0.06 – 0.09 | <0.001 |
education [PhD] | 0.34 | 0.31 – 0.37 | <0.001 | 0.34 | 0.31 – 0.37 | <0.001 | 0.34 | 0.31 – 0.36 | <0.001 |
education [Some College] | -0.09 | -0.13 – -0.04 | <0.001 | -0.09 | -0.13 – -0.04 | <0.001 | -0.08 | -0.13 – -0.04 | <0.001 |
title [Data Scientist] | 0.35 | 0.30 – 0.40 | <0.001 | 0.34 | 0.30 – 0.39 | <0.001 | 0.34 | 0.29 – 0.39 | <0.001 |
title [Hardware Engineer] | 0.27 | 0.22 – 0.32 | <0.001 | 0.26 | 0.21 – 0.30 | <0.001 | 0.25 | 0.20 – 0.30 | <0.001 |
title [Human Resources] | 0.10 | 0.03 – 0.17 | 0.008 | 0.12 | 0.05 – 0.19 | 0.001 | 0.13 | 0.06 – 0.20 | <0.001 |
title [Management Consultant] |
0.12 | 0.07 – 0.18 | <0.001 | 0.12 | 0.07 – 0.18 | <0.001 | 0.13 | 0.07 – 0.18 | <0.001 |
title [Marketing] | 0.20 | 0.15 – 0.26 | <0.001 | 0.22 | 0.16 – 0.27 | <0.001 | 0.22 | 0.16 – 0.28 | <0.001 |
title [Mechanical Engineer] |
0.09 | 0.03 – 0.15 | 0.004 | 0.08 | 0.02 – 0.14 | 0.014 | 0.08 | 0.01 – 0.14 | 0.016 |
title [Product Designer] | 0.35 | 0.30 – 0.40 | <0.001 | 0.36 | 0.31 – 0.41 | <0.001 | 0.36 | 0.31 – 0.41 | <0.001 |
title [Product Manager] | 0.45 | 0.40 – 0.49 | <0.001 | 0.45 | 0.40 – 0.49 | <0.001 | 0.45 | 0.40 – 0.49 | <0.001 |
title [Recruiter] | 0.07 | 0.01 – 0.14 | 0.031 | 0.09 | 0.03 – 0.16 | 0.006 | 0.10 | 0.03 – 0.16 | 0.004 |
title [Sales] | 0.33 | 0.26 – 0.40 | <0.001 | 0.33 | 0.26 – 0.40 | <0.001 | 0.33 | 0.26 – 0.40 | <0.001 |
title [Software Engineer] | 0.40 | 0.36 – 0.44 | <0.001 | 0.39 | 0.35 – 0.43 | <0.001 | 0.39 | 0.35 – 0.43 | <0.001 |
title [Software Engineering Manager] |
0.63 | 0.58 – 0.67 | <0.001 | 0.62 | 0.57 – 0.66 | <0.001 | 0.61 | 0.57 – 0.66 | <0.001 |
title [Solution Architect] |
0.31 | 0.25 – 0.37 | <0.001 | 0.30 | 0.24 – 0.35 | <0.001 | 0.30 | 0.24 – 0.35 | <0.001 |
title [Technical Program Manager] |
0.34 | 0.29 – 0.39 | <0.001 | 0.34 | 0.29 – 0.39 | <0.001 | 0.33 | 0.29 – 0.38 | <0.001 |
Index | 0.01 | 0.01 – 0.01 | <0.001 | 0.01 | 0.01 – 0.01 | <0.001 | 0.01 | 0.01 – 0.01 | <0.001 |
gender [Female] | -0.07 | -0.08 – -0.05 | <0.001 | -0.07 | -0.08 – -0.05 | <0.001 | |||
gender [Other] | -0.03 | -0.11 – 0.05 | 0.447 | -0.04 | -0.12 – 0.04 | 0.378 | |||
Race [Asian] | 0.02 | 0.00 – 0.03 | 0.007 | ||||||
Race [Black] | -0.06 | -0.09 – -0.03 | <0.001 | ||||||
Race [Hispanic] | -0.02 | -0.04 – 0.01 | 0.156 | ||||||
Race [Two Or More] | 0.03 | -0.00 – 0.06 | 0.065 | ||||||
Observations | 16961 | 16961 | 16961 | ||||||
R2 / R2 adjusted | 0.411 / 0.410 | 0.413 / 0.413 | 0.414 / 0.414 |
After controlling for years of experience, years at the company, education, job title, cost of living index and gender, the only group that differs significantly from White posters at the p < .001 threshold in total annual compensation are Black posters who are compensated on average 6% less annually. Difference in total annual compensation between White and Asian posters was significant at the p < .01 threshold, with Asian posters making on average 2% more annually than White posters. Though these effects were significant, the addition of race added only .1% more explanatory power to the overall model.
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
16937 | 2337.514 | NA | NA | NA | NA |
16933 | 2333.170 | 4 | 4.344585 | 7.882717 | 2.4e-06 |
The F test between Model 2b and Model 2b confirms that the addition of race provided a significant amount of explanatory power to the model.
The QQ plot and histogram of residuals for Model 2c show approximately normal distribution.
The direction of all coefficients in the final model support the hypothesis outlined earlier in this paper and agree with PayScale’s 2021 analysis. By and large, the magnitude of the coefficients for race variables agreed with PaysScale’s analysis as well. One poignant difference between this analysis and PayScale’s is the magnitude of the coefficient for Gender [Female]. In our controlled model (Model 2c), the coefficient for Gender [Female] of -.07 is significantly greater than what would have been expected based on the difference in pay of 2% that PayScale found between men and women across industries. This suggests that there may be a larger degree of discrimination in pay based on gender in the tech industry compared to other industries. Perhaps some of this effect can be explained by debunked yet pervasive stereotypes that women naturally have less ability in quantitative disciplines. This is a fascinating area for further research.
There was a large drop in coefficient magnitude for the dummy-coded race and gender variables after controlling for years of experience, years at the company, education, job title and cost of living. This drop was especially large in the coefficient for Race [Black], going from -.17 to -.06 - a difference of 11% explanatory power. This suggests that there is important information contained in the control variables that should be explored further. Systemic differences between racial groups and genders in educational attainment, job title, years of experience and tenure at a company would all affect total annual compensation. If these mediating factors are not addressed along with outright discrimination, financial parity for demographic groups that have historically been excluded from the tech industry will be severely slowed.
Ultimately, there remains a significant amount of variance in total annual compensation that cannot be explained by any of the control variables, particularly for Black tech workers and for women. As the population of these groups rises in the industry, it is increasingly important to continue to analyze the biased systems and attitudes that contribute to this phenomenon.
Though the sample for this analysis was quite large, it was still a self-selected group who chose to post on levels.fyi, and is likely not perfectly representative of the tech industry. There may be a greater prevalence of posts coming from tech workers in large tech companies rather than smaller companies or tech professions within other industries. Because the coefficients of the final model agree quite closely with PayScale’s analysis, I am not worried that there is a large amount of bias in the sample.
Many of the records in the initial dataset were not included in the analysis due to missing data. It would be worth collecting more data in a few months or a year to add cases.
There is always the possibility that the remaining variance in total annual compensation between genders and races in the final model can be attributed not solely to discrimination but to some other as of yet unidentified variable that is not present in our dataset.
Additional analyses using salary, bonus and stock grant value separately as dependent variables would be interesting to compare to the current model predicting total annual compensation. I would like to scrape the levels.fyi data myself at a later date to get a larger sample with more education, race and gender data.
A deeper analysis of differences in control variables between racial and gender groups would also be interesting.
Because the control variables explained so much of the variance in pay between genders and races, I thought it would be nice to visualize some of these differences. This is not part of the formal analysis - just an extra because I wanted to know and I like ggplot.
There are a number of interesting takeaways from the chart above illustrating differences in the popularity of job titles between racial groups. Compared to other races, there is a noticeably higher percentage of Black posters who report the title “Business Analyst” - the lowest ranking job title in total yearly compensation. White posters had the highest percentage reporting the highest-averaged paid title by over $100k, “Software Engineering Manager,” despite Asian and Hispanic posters reporting higher percentages of Software Engineers. This agrees with Professor Jackson Lu’s research on the “Bamboo Ceiling.” It would be interesting to assess differences in job titles between South and East Asian posters if levels.fyi ever collects more detailed data on ethnicity.
Unsurprisingly, we see that posters who identify as female in our sample are more likely to report job titles that are less technical, or more person-facing than those who identify as male.
When job titles are grouped into the top, bottom, and middle average earning titles, some additional insight can be gleaned.
Black posters are overrepresented in the bottom 5 earning job titles compared to all other racial groups. Asian posters, and to a lesser extent Hispanic posters, are underrepresented in the 5 highest earning job titles. Discrepancies in popularity of higher vs. lower earning job titles between racial groups is certainly an interesting area to focus on within this dataset in future analyses.
Male and female identified job posters see opposite patterns of popularity for the top, bottom and middle average earning job title groups. Female identifying posters are twice as likely to report a bottom five earning title than male posters - they are also more likely than male posters to report titles in the highest five earning titles. Male poster’s job titles are more concentrated in the middle earning titles. This is likely explained by male poster’s increased likelihood to work in technical roles. Of the top five highest earning titles, three are managerial positions.
The charts below compare the popularity of levels of education between racial groups and genders using the same methodology as the comparisons of job title.
It is immediately apparent that There are huge differences in educational attainment. Asian posters in our sample are about twice as likely to have a master’s degree than a member of any other racial group. This is also a potentially fruitful area of research to pursue within the dataset.
Female posters are slightly more likely than male posters to have a bachelor’s or master’s degree, while male posters are more likely to have a PhD. A much higher percentage of men have education levels below college, but this accounts for a very small portion of the overall population.