Lee Voth-Gaeddert and I have been working on methods to explore and analyze health data to tackle stunting in Guatemala. Our technical paper was titled “Improving Health Information Systems in Guatemala Using Weighted Correlation Network Analysis”. This paper is an early-stage effort to look at weighted correlation network analysis as a potential tool.
Download my presentation or view the transcribed version below.
My name is Devin Cornell and I’m a PhD Student in Sociology at UC, Santa Barbara where I study group decision-making and team performance. Not here today is my colleague Lee Voth-Gaeddart. He’s currently in Guatemala working as a volunteer with the Peace Corps.
Our topic here is a tool that can help professionals interpret and build narratives of understanding from large amounts of data.
We take advantage of popular tools for big data analysis and visualization to bring a new experience to collaborators. Our hope is that the approach can be simple enough to be used by all stakeholders but flexible enough to explore problems from many different perspectives.
The presentation follows the four main aspects of a research presentation, both with our analysis and results using the Guatemala data and the tool in general.
We are motivated by Child Stunting in Guatemala. Stunting is defined as the condition of being two standard deviations below the normalized world height-for-age distributions. Although there may be many possible causes, it is essentially agreed upon that this is due to lack of access or absorption of the proper nutrients for bodily development.
23% of children are stunted globally and in Guatemala that number increases to 49%. This problem could result in physical or mental development issues and is addressed by Sustainable Development Goal 2.2: “By 2030, end all forms of malnutrition, including … stunting and wasting in children under 5 years of age…”
This diagram shows a 2013 WHO report outlining the major groups of causes and effects of stunting in the population. It has been described as one of the most wicked problems faced in development today. The biggest direct factors are household and family structure, inadequate feeding, breastfeeding, and infection. Remember this diagram so we can compare it with the results of our analysis later.
Now let’s look at health information systems and the role they could play in looking at this problem. According to a 2008 report by the WHO on Health Information Systems, there are four purposes to these systems. While most of the current HIS work in Guatemala focuses on the first two, we are working on the analysis and communication portions. If we assume we can get access to compiled data, how can we perform analysis?
In theory, the classical social science approach is to use data and assumptions about the world to build a model that leads to understanding of the real world. While our approach focuses much more on the data in order to be able to change assumptions, we are primarily interested in the feedback loop between our derived understanding back into the model and the assumptions.
We want to be able to quickly examine different possible assumptions made about the underlying system to get a better mental picture of what is actually going on.
Our approach has three aspects: tools, a dataset, and analysis algorithms.
Our selection of tools is important because we are taking advantage of modern tools for data analysis that can also scale to production web or desktop applications. We also started using the Gephi network visualization software in order to explore the data. While our current python program takes in data and spits out a “.gexf” file that can be visualized in Gephi, eventually we’d like to make the visualization happen alongside the python code. We have started working on this, but existing solutions still seem to be lacking so it may take some time. These two aspects were important in our determining of the timeliness of this project.
Our dataset was compiled and encoded from the results published with the Title II Development Food Assistance Program contracted by the USAID Office of Food for Peace. They surveyed mostly agrarian families about health behaviors like child health, household descriptions, maternal health, sanitation, and breastfeeding.
The general idea of our network analysis approach is that we start with a number of measured variables corresponding to encoded responses specific survey questions about these health behaviors. Next, all the information about these variables is reduced to simply the relationship between them. This can be captured in any number of ways, but our approach was to use simple correlation. Finally, we perform some transformation on the relationship network that reduces the model to include only the useful information. This is the essentially the network implementation of the data reduction approach that I described earlier in the presentation. This compression leads us to draw conclusions and extract meaning from our data, but it also has the effect of leaving out part of the story.
So again, the relationship we used to prove out the approach was a simple correlation that ignores samples with missing data. Then we performed a transformation on the edge weights that maps large correlations to have small weights, and then defines path distances as the sum of these transformed edge weights.
To illustrate this we look at the following graph with v1, v2, and v3. Assume that the outcome is v4 and v1 is a cause. The question is whether v1 affects v4 directly or whether v1 causes v2 which causes v4. Correlation alone can’t answer this question, but our knowledge about the underlying system can. Since v1 has a reasonably significant correlation with v4, that causal pathway is significant and a particular beta parameter would remove the two correlations of 0.6 to result in only v1->v4 (and perhaps also v2->v4). An increase Beta value (specifically above ~2.1 – see supplemental slide at the end for this) would result in the use of the indirect pathway v1->v2->v4.
The point is that either of these perspectives could be valid but we don’t really know that without careful examination of both possibilities. Thus only using the raw data and a specific selection of Beta we can examine both perspectives.
For our analysis we plugged in our USAID data with the arbitrary selection of Beta = 2. If you examine the overall pattern of the figure, it closely resembles the World Health Organization’s diagram. That figure was agreed upon by experts familiar with the problem along with many studies. Our perspective was generated by simply plugging in the data with the particular Beta = 2 and visualizing.
So we get the overall character of the data, but what else can we learn.
The next step we explored was the use of quantile analysis. This involves breaking up the data according to a specific variable (we used stunting then age), and creating a separate causal tree for each one. Let’s look at the stunting quantiles first.
The first causal tree created from children who are classified as ‘not stunted’ appears at the top. There are five first-order factors shown in that one. Now compare to the stunted (bottom left) and extremely stunted (bottom right) causal trees. Although factors like corn storage and diet diversity are still important, they are relatively less important when compared with the ones that do appear. With the same beta parameter this structural comparison can quickly convey significant difference in causal structure.
We can make a couple conclusions about the results: food and diet are less important for stunted children. This could mean that factors like breastfeeding, etc are necessary but not sufficient for the normal development of Guatemalan children. Again, this is only one possible way of looking at our data. Now let’s explore an examination of differences across age quantiles.
Now let’s look at four age quantiles. Conclusions here are a little less obvious.
If you do look closely there are a few things we can tell though. First, animals and fecal matter exposure matter more for young children. Could this be because free-range animal exposure could be affecting young children height? It could be a possibility. We also find differences in diet diversity, and corn storage affects at younger ages.
More general conclusions: we can quickly explore many possible perspectives by adjusting Beta. This could be a useful method of analysis if placed into the proper end application.
In order to explore stunting in Guatemala further we are currently working on collecting more data. W will need to capture more information about mother and child behavior to determine how they relate to stunting.
Furthermore, this presentation demonstrates the usefulness of this approach to be used in some kind of collaborative application. We plan to work in the future to build a prototype application that uses this analysis in an interactive manner.
And these are my references. Any questions?
For reference I also attached a few supplemental slides. The first one looks at the simple three-factor example I used previously in the presentation.
The figure above shows the competing path lengths as the parameter beta is varied. When Beta passes the intersection point around 2.1 the causal pathway directly between v1->v4 will disappear in favor of v1->v2->v4. This is an example of how the parameter can reveal different perspectives that can all be justified by the data. Causation vs correlation is a problem first presented with the advent of science – there’s no clear cut way to determine causality without subject matter expertise, so this tool helps experts think about problems from different perspectives.
For completeness I also wanted to include some of the survey questions associated with the figures in the presentation.
And finally, this is a complete list of equations used to construct the network used for analysis.