Genes That Shape Bones Identified, Offering Clues About Our Past and Future

Frontera supports analysis of deep learning models for genetic map of skeletal proportions

The genes that shape the human skeleton have been identified, in research that used TACC’s Frontera supercomputer to run analysis of deep learning models. Credit: UT Austin News.

Adapted from a news release by Esther Robards-Forbes, College of Natural Sciences, UT Austin

Using artificial intelligence (AI) to analyze tens of thousands of X-ray images and genetic sequences, researchers from The University of Texas at Austin and New York Genome Center have been able to pinpoint the genes that shape our skeletons, from the width of our shoulders to the length of our legs.

The research, published July 2023 as the cover article in Science, pulls back a curtain on our evolutionary past and opens a window into a future where doctors can better predict patients’ risks of developing conditions such as back pain or arthritis in later life.

“Our research is a powerful demonstration of the impact of AI in medicine, particularly when it comes to analyzing and quantifying imaging data, as well as integrating this information with health records and genetics rapidly and at large scale,” said Vagheesh Narasimhan, an assistant professor of integrative biology as well as statistics and data science, who led the multidisciplinary team of researchers to provide the genetic map of skeletal proportions.

Humans are the only large primates to have longer legs than arms, a change in the skeletal form that is critical in enabling the ability to walk on two legs. The scientists sought to determine which genetic changes underlie anatomical differences that are clearly visible in the fossil record leading to modern humans, from Australopithecus to Neanderthals. They also wanted to find out how these skeletal proportions allowing bipedalism affect the risk of many musculoskeletal diseases such as arthritis of the knee and hip — conditions that affect billions of people in the world and are the leading causes of adult disability in the United States.

The researchers used deep learning models to perform automatic quantification on 39,000 medical images to measure distances between shoulders, knees, ankles and other points in the body. By comparing these measurements to each person’s genetic sequence, they found 145 points in the genome that control skeletal proportions.

The computer vision model called HRNet performed the image classification. The researchers used two rounds of transfer learning, where knowledge learned from a completing a task was reincorporated to boost model performance. The scientists validated their approach on images taken of the same person more than two years apart and found that the correlation in their measurements was greater than 99 percent accurate. 

All analysis was performed on the UKBiobank dataset, which has paired imaging, genetic and lifetime electronic healthcare records on 30,000 anonymized individuals. Researchers used deep learning models for performing quality control and for measurements on the imaging data. To connect these measurements, they carried out genome-wide association studies that employ a linear mixed model to control for population structure amongst individuals in the dataset.

Scientists Rely on Supercomputing 

The main computational challenges were on the imaging side of the study, where the science team needed access to high performance GPUs for model training. Another major challenge was the genetic data, which is extremely large and accounts for millions of positions and tens of thousands of individuals. 

Narasimhan was awarded an allocation on TACC’s Frontera, the fastest academic supercomputer in the U.S., which he used to meet these computational challenges.

“TACC has both offerings for large memory GPUs as well as a large number of high-performance CPUs,” Narasimhan said. “Genetic data is naively parallelized, meaning that TACC is ideally suited to allow us to run our association analysis independently on different parts of the data and stitch them back together at the end. We used tens of thousands of hours of compute to perform these calculations, and it would have been impossible without access to this resource.”

TACC’s Frontera, the fastest academic supercomputer in the US, is a strategic national capability computing system funded by the National Science Foundation.

“Our work provides a road map connecting specific genes with skeletal lengths of different parts of the body, allowing developmental biologists to investigate these in a systematic way,” said Tarjinder (T.J.) Singh, the study’s co-author, and associate member at NYGC and assistant professor in the Columbia University Department of Psychiatry.

The team also examined how skeletal proportions associate with major musculoskeletal diseases and showed that individuals with a higher ratio of hip width to height were found to be more likely to develop osteoarthritis and pain in their hips. Similarly, people with higher ratios of femur (thigh bone) length to height were more likely to develop arthritis in their knees, knee pain and other knee problems. People with a higher ratio of torso length to height were more likely to develop back pain.

“These disorders develop from biomechanical stresses on the joints over a lifetime,” said Eucharist Kun, a UT Austin biochemistry graduate student and lead author on the paper. “Skeletal proportions affect everything from our gait to how we sit, and it makes sense that they are risk factors in these disorders.”

The results of their work also have implications for our understanding of evolution. The researchers noted that several genetic segments that controlled skeletal proportions overlapped more than expected with areas of the genome called human accelerated regions. These are sections of the genome shared by great apes and many vertebrates but are significantly diverged in humans. This provides genomic rationale for the divergence in our skeletal anatomy.

Said Narasimhan: “In the past, research conducted in human evolution used extremely low sample sizes and involved manual measurement. Thus, the field was largely restricted to qualitative analysis. The integration of genetic information has created a revolution in the field allowing for precise statistical estimates to be produced from the ever-expanding amount of data available. Genetic data is inherently large in size and methods to analyze them have gone hand in hand with technological development to obtain such information. These methods have always been employed on supercomputing environments which allow for distributed computing and efficient I/O.”

In addition to Kun and Narasimhan, the co-authors are Tarjinder Singh of the New York Genome Center and Columbia University; Emily M. Javan, Olivia Smith, Javier de la Fuente, Brianna I. Flynn, Kushal Vajrala, Zoe Trutner, Prakash Jayakumar and Elliot M. Tucker-Drob of UT Austin; Faris Gulamali of Icahn School of Medicine at Mount Sinai; and Mashaal Sohail of Universidad Nacional Autonoma de Mexico. The research was funded by the Allen Institute, Good Systems at UT Austin and the National Institutes of Health, with graduate student fellowship support provided by the National Science Foundation and UT Austin’s provost’s office.