Big or small-let’s save them all: How the data was collected

For this research study, I have chosen the “Gapminder Codebook”. It combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighbourhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviours in adolescence are linked to health and achievement outcomes in young adulthood.

Data source:

GapMinder data are comprised of global development indicators curated by the Gapminder Foundation. The foundation is a non-profit venture registered in Stockholm, Sweden, aiming at promoting sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.

Data collection:

Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas. GapMinder collects data from multiples sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank.

So we know that the data for this study was collected using a longitudnal survey method. According to the gapminder website FAQ’s from where this dataset is taken from, gapminder states that it uses a survey software called SurveyGizmo to creat their surveys for data collection.

Measures :

After studying the Gapminder Codebook, I have decided that I am particularly interested in determining if alcohol consumption can increase the risk of breast cancer.

I therefore, add to my codebook the following two variables

a. ‘alcconsumption’ – 2008 alcohol consumption per adult (age 15+), litres Recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol

b. ‘breastcancerper100TH’- 2002 breast cancer new cases per 100,000 female Number of new cases of breast cancer in 100,000 female residents during the certain year.

I now need to find if there are any other variables that directly or indirectly contribute to breast cancer risk. So, in a second review of the Gapminder Codebook, I think that working women are prone to a greater risk of breast cancer. So, the confounding variable that I would like to explore in this study is the association between breast cancer in female employee. So the third variable will be ‘femaleemployrate’ given as;

c.  ‘femaleemployrate’- 2007 female employees age 15+ (% of population) Percentage of female population, age above 15, that has been employed during the given year.

Studying the data codebook as well as some previous related works on the particular dataset is particularly very helpful in either determining a valid research question to explore or in determining the variables to study.

The outcomes of interest (response variables) Breast cancers rate. The explanatory variable are alcoholconsumption and  femaleemployrate. All these variables are quantitative continuous, so there will be no need to manage or recode them for regression analysis.