Publications

High-Dimensional Population Flow Time Series Forecasting Via an Interpretable Hierarchical Transformer​

Authors

Songhua Hu and Chenfeng Xiong

Abstract

Location-based service (LBS) data are emerging data sources in the transportation domain which contain large-scale, fine-grained, near real-time information in population flow. However, limited studies have built forecasting models based on population flow time series extracted from LBS data. This study introduces a deep learning framework, the Interpretable Hierarchical Transformer (IHTF), for high-dimensional multi-horizontal population flow time series forecasting and interpretation. A variety of cutting-edge deep learning technologies are fused, including the gated residual network to control nonlinearity and to bypass irrelevant information, the variable selection network to assign canonical variable-wise weight, the recurrent positional encoding to learn…

Managing and assembling population-scale data streams, tools and workflows to plan for future pandemics within the INFORM-Africa Consortium.

Authors

Poongavanan J, Xavier J, Dunaiski M, Tegally H, Oladejo SO, Ayorinde O, Wilkinson E, Baxter C, de Oliveira T 

Abstract

The Data Management and Analysis Core and Next Generation Sequencing Core will facilitate and support effective data management and analysis across INFORM Africa Consortium. We will capture and provide analysis support for relevant, timely, accurate and coherent data that can be interpreted and accessed across collaborators in multiple African countries and all collaborators in this Hub and future research hub collaborators through pilot projects. Ultimately, our focus is to enable increased access to high quality data and reproducible data analysis that can be used as tools to engage policy makers in view to better prepare for future pandemics.

INFORM Africa publications 

Managing and assembling population-scale data streams, tools and workflows to plan for future pandemics within the INFORM-Africa Consortium

AUTHORS: Poongavanan J, Xavier J, Dunaiski M, Tegally H, Oladejo SO, Ayorinde O, Wilkinson E, Baxter C, de Oliveira T 
YEAR OF PUBLICATION: 2023 

Abstract

The Data Management and Analysis Core and Next Generation Sequencing Core will facilitate and support effective data management and analysis across INFORM Africa Consortium. We will capture and provide analysis support for relevant, timely, accurate and coherent data that can be interpreted and accessed across collaborators in multiple African countries and all collaborators in this Hub and future research hub collaborators through pilot projects. Ultimately, our focus is to enable increased access to high quality data and reproducible data analysis that can be used as tools to engage policy makers in view to better prepare for future pandemics.

High-Dimensional Population Flow Time Series Forecasting Via an Interpretable Hierarchical Transformer

Authors: Songhua Hu and Chenfeng Xiong
Year of Publication: 2022

Abstract

Location-based service (LBS) data are emerging data sources in the transportation domain which contain large-scale, fine-grained, near real-time information in population flow. However, limited studies have built forecasting models based on population flow time series extracted from LBS data. This study introduces a deep learning framework, the Interpretable Hierarchical Transformer (IHTF), for high-dimensional multi-horizontal population flow time series forecasting and interpretation. A variety of cutting-edge deep learning technologies are fused, including the gated residual network to control nonlinearity and to bypass irrelevant information, the variable selection network to assign canonical variable-wise weight, the recurrent positional encoding to learn temporal locality, and the transformer architecture to capture temporal seasonality and trend. Various exogenous variables are included, endowing the framework with sensitivity in socioeconomics, demographics, land development, weather conditions, and holidays. Different internal parameters, such as variable selection weight and temporal attention weight, are extracted to explain underlying patterns learned by the framework. Numerical experiments based on one-year nationwide county-level population flow time series show that: 1) IHTF outperforms extensive baseline models in model accuracy, yielding symmetric mean absolute percentage error (SMAPE) from 8.420% (1-day-ahead) to 11.178% (21-day-ahead). 2) Model performances vary substantially across counties. Large counties broadly present better performances in relative metrics but worse performances in absolute metrics. 3) Feature relative importance generated by IHTF is similar to tree-based model but with more even distribution, among which point-of-interests (POIs) count, county location, median household income, and percentage of accommodation and food services are the most important static variables. 4) Attention weight demonstrates that IHTF can automatically learn trend and seasonality from raw time series. The framework can serve as a dynamic travel demand forecasting module in the transportation planning process. Outcomes can be fed into dynamic traffic assignment to obtain time-dependent link-level traffic conditions in future scenarios.

Setting up data science research in Africa and engagement of stakeholders

Authors:  Fati Murtala-Ibrahim; Jibreel Jumare; Manhattan Charurat; Chenfeng Xiong, Vivek Naranbhai; Patrick Dakum; Shirley Collie; Waasila Jassat; Gambo Aliyu; Adetifa Ifedayo; Alash’le Abimiku
Year of Publication: 2023

Abstract

Data science explores the use of big data to gain deeper insights and generate new knowledge and innovations which can lead to economic growth and sustainable development. However, setting up data science research comes with challenges. How we engage stakeholders is a major factor that determines success. This Commentary highlights important considerations for stakeholder engagement based on the experiences of investigators in a data science for health discovery project underway in Nigeria and South Africa. The perspectives presented will 
guide implementation in this relatively new but rapidly growing research domain

Revealing human mobility trends during the SARS-CoV-2 pandemic in Nigeria via a data-driven approach

Authors: Weiyu Luo; Chenfeng Xiong; Jiajun Wan; Ziteng Feng; Olawole Ayorinde; Natalia Blanco, Man Charurat, Vivek Naranbhai, Christina Riley, Anna Winters, Fati Murtala-Ibrahim and Alash’le Abimiku
Year of Publication: 2023

Abstract

We employed emerging smartphone-based location data and produced daily human mobility measurements using Nigeria as an application site. A data-driven analytical framework was developed for rigorously producing such measures using proven location intelligence and data-mining algorithms. Our study demonstrates the framework at the beginning of the SARS-CoV-2 pandemic and successfully quantifies human mobility patterns and trends in response to the unprecedented public health event. Another highlight of the paper is the assessment of the effectiveness of mobility-restricting policies as key lessons learned from the pandemic. We found that travel bans and federal lockdown policies failed to restrict trip-making behaviour, but had a significant impact on distance travelled. This paper contributes a first attempt to quantify daily human travel behaviour, such as trip-making behaviour and travelling distances, and how mobility-restricting policies took effect in sub-Saharan Africa during the pandemic. This study has the potential to enable a wide spectrum of quantitative studies on human mobility and health in sub-Saharan Africa using well-controlled, publicly available large data sets.

Publications with INFORM Africa acknowledgments

Alignment-Free Viral Sequence Classification at Scale

Alignment-Free Viral Sequence Classification at Scale

Authors: Daniel J. van Zyl, Marcel Dunaiski, Houriiyah Tegally, Cheryl Baxter, Tulio de Oliveira, Joicymara S. Xavier & The INFORM Africa research study group

Year of Publication: 2024

Abstract

The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets. We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively. Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.

Craft: A Machine Learning Approach to Dengue Subtyping

Craft: A Machine Learning Approach to Dengue Subtyping

Authors: van Zyl DJ, Dunaiski M, Tegally H, Baxter C; INFORM Africa research study group; de Oliveira T, Xavier JS.

Year of Publication: 2025

Abstract

The dengue virus poses a major global health threat, with nearly 390 million infections annually. A recently proposed hierarchical dengue nomenclature system enhances spatial resolution by defining major and minor lineages within genotypes, aiding efforts to track viral evolution. While current subtyping tools– Genome Detective, GLUE, and NextClade– rely on computationally intensive sequence alignment and phylogenetic inference, machine learning presents a promising alternative for achieving accurate and rapid classification. We present Craft (Chaos Random Forest), a machine learning framework for dengue subtyping. We demonstrate that Craft is capable of faster classification speeds while matching or surpassing the accuracy of existing tools. Craft achieves 99.5% accuracy on a hold-out test set and processes over 140000 sequences per minute. Notably, Craft maintains remarkably high accuracy even when classifying sequence segments as short as 700 nucleotides.

Review of open-source software for developing heterogeneous data management systems for bioinformatics applications

Authors: Danilo Silva , Monika Moir , Marcel Dunaiski , Natalia Blanco , Fati Murtala-Ibrahim , Cheryl Baxter , Tulio de Oliveira , Joicymara S Xavier , The INFORM Africa research study group

Year of Publication: 2025

Abstract

In a world where data drive effective decision-making, bioinformatics and health science researchers often encounter difficulties managing data efficiently. In these fields, data are typically diverse in format and subject. Consequently, challenges in storing, tracking, and responsibly sharing valuable data have become increasingly evident over the past decades. To address the complexities, some approaches have leveraged standard strategies, such as using non-relational databases and data warehouses. However, these approaches often fall short in providing the flexibility and scalability required for complex projects. While the data lake paradigm has emerged to offer flexibility and handle large volumes of diverse data, it lacks robust data governance and organization. The data lakehouse is a new paradigm that combines the flexibility of a data lake with the governance of a data warehouse, offering a promising solution for managing heterogeneous data in bioinformatics. However, the lakehouse model remains unexplored in bioinformatics, with limited discussion in the current literature. In this study, we review strategies and tools for developing a data lakehouse infrastructure tailored to bioinformatics research. We summarize key concepts and assess available open-source and commercial solutions for managing data in bioinformatics.

Genomic diversity and surveillance of SARS-CoV-2 in Nigeria

Genomic diversity and surveillance of SARS-CoV-2 in Nigeria

Authors: Thomas J Y Kono , Ezenwa J Onyemata , Natalia Blanco , Chika K Onwuamah  5, Nnaemeka Ndodo , Paul Oluniyi , Olanrewaju Lawal , Christina Riley, Sophia Osawe , Cheryl Baxter, Anna Winters, Chenfeng Xiong, Christian T Happi, Babatunde L Salako, Ifedayo Adetifa , Alash’le Abimiku , Manhattan Charurat, Kristen A Stafford; INFORM Africa Research Study Group

Year of Publication: 2025

Abstract

As Nigeria has the sixth-highest population in the world and a significant amount of inbound and outbound travel, the characterization of SARS-CoV-2 genomic diversity across the country is critical for understanding novel pandemic dynamics. We describe the genomic diversity of SARS-CoV-2 in Nigeria throughout the COVID-19 pandemic and examine the coverage of Nigeria’s genomic surveillance system. Genome sequences and sample metadata were downloaded from the GISAID repository. A beta regression was used to test for a relationship between fully resolved nucleotide proportion over time, as a proxy for data quality. Sample and sequencing source were compared to assess geographic coverage. A total of 7759 COVID-19 sequences collected from February 2020 to March 2023 were included. The majority were collected in 2021 (76.6%) and South West (43%). Eleven states (30%) reported 10 or fewer SARS-CoV-2 genomes across the entire period. The genome sequences submitted to GISAID from Nigeria were of high quality with very few unresolved nucleotides. Waves 4 and 5, predominantly Omicron lineages, show higher diversity around position 23 kb than the other waves. Overall, the Nigeria Centre for Disease Control (NCDC) and state-run hospitals were the largest contributors to the sample collection efforts during this study period. However, the collection efforts shifted over time from NCDC in waves 1-3 to regional hospitals and other healthcare facilities in waves 4-5, although this pattern varied by geopolitical zone (GPZ). Sequencing efforts also shifted from research laboratories during the first waves to NCDC during waves 4 and 5. The findings suggest the need for a coordinated sequencing strategy and standardized protocols to improve genomic surveillance during future outbreaks of existing and novel pathogens. A network of sequencing laboratories that includes at least one in each GPZ, linked to and coordinated by the national reference laboratory at NCDC might provide more balanced coverage for future pandemics and pathogen surveillance.

Variant patterns and influence of inter-regional travel during the SARS-CoV-2 expansion in South Africa

Authors: Luo W, Wu X, Li R, Fitzpatrick M, Charurat M, Blanco N, Stafford KA, Naranbhai V, Abimiku A, Winters A; INFORM Africa Research Study Group for D-SI Africa Consortium; Xiong C.

Year of Publication: 2025

Abstract

We evaluated the dynamic impacts of three types of human mobilities-provincial inflows, cross-district flows, and within-district flows-on daily reported COVID-19 cases for 2020. Using a structural equation modeling approach, we conducted regressions on dynamic panel datasets. Our findings indicate that these three types of mobility influenced daily new COVID-19 case numbers in distinct and sometimes overlapping ways during the early stages of the epidemic. Within-district flows played a particularly significant role in increasing cases during the spreading stage. During the epidemic stage, we observed a sustained but gradually declining impact of within-district mobility on daily new cases, potentially highlighting the effectiveness of non-pharmaceutical interventions (NPIs). In addition, signs of social distancing fatigue were evident. Our model further shows that the first and most stringent lockdown policy significantly curtailed human mobility, whereas the second, less restrictive lockdown had negligible impact on human mobility.

The evolving SARS-CoV-2 epidemic in Africa: Insights from rapidly expanding genomic surveillance

Authors: Houriiyah Tegally, James E. San, Matthew Cotton, Monika Moir, Bryan Tegomoh, Gerald Mboowa, Darren P. Martin, Cheryl Baxter, Arnold W. Lambisia et al.
Year of Publication: 2022

Abstract

In many regions of the world, the Alpha, Beta and Gamma SARS-CoV-2 Variants of Concern (VOCs) co-circulated during 2020-21 and fueled waves of infections. During 2021, these variants were almost completely displaced by the Delta variant, causing a third wave of infections worldwide. This phenomenon of global viral lineage displacement was observed again in late 2021, when the Omicron variant disseminated globally. In this study, we use phylogenetic and phylogeographic methods to reconstruct the dispersal patterns of SARS-CoV-2 VOCs worldwide. We find that the source-sink dynamics of SARS-CoV-2 varied substantially by VOC, and identify countries that acted as global hubs of variant dissemination, while other countries became regional contributors to the export of specific variants. We demonstrate a declining role of presumed origin countries of VOCs to their global dispersal: we estimate that India contributed <15% of all global exports of Delta to other countries and South Africa <1-2% of all global Omicron exports globally. We further estimate that >80 countries had received introductions of Omicron BA.1 100 days after its inferred date of emergence, compared to just over 25 countries for the Alpha variant. This increased speed of global dissemination was associated with a rebound in air travel volume prior to Omicron emergence in addition to the higher transmissibility of Omicron relative to Alpha. Our study highlights the importance of global and regional hubs in VOC dispersal, and the speed at which highly transmissible variants disseminate through these hubs, even before their detection and characterization through genomic surveillance.

Emergence of SARS-CoV-2 Omicron lineages BA.4 and BA.5 in South Africa

Authors: Houriiyah Tegally, James E. San, Matthew Cotton, Monika Moir, Bryan Tegomoh, Gerald Mboowa, Darren P. Martin, Cheryl Baxter, Arnold W. Lambisia et al.
Year of Publication: 2022

Abstract

Three lineages (BA.1, BA.2 and BA.3) of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Omicron variant of concern predominantly drove South Africa’s fourth Coronavirus Disease 2019 (COVID-19) wave. We have now identified two new lineages, BA.4 and BA.5, responsible for a fifth wave of infections. The spike proteins of BA.4 and BA.5 are identical, and similar to BA.2 except for the addition of 69–70 deletion (present in the Alpha variant and the BA.1 lineage), L452R (present in the Delta variant), F486V and the wild-type amino acid at Q493. The two lineages differ only outside of the spike region. The 69–70 deletion in spike allows these lineages to be identified by the proxy marker of S-gene target failure, on the background of variants not possessing this feature. BA.4 and BA.5 have rapidly replaced BA.2, reaching more than 50% of sequenced cases in South Africa by the first week of April 2022. Using a multinomial logistic regression model, we estimated growth advantages for BA.4 and BA.5 of 0.08 (95% confidence interval (CI): 0.08–0.09) and 0.10 (95% CI: 0.09–0.11) per day, respectively, over BA.2 in South Africa. The continued discovery of genetically diverse Omicron lineages points to the hypothesis that a discrete reservoir, such as human chronic infections and/or animal hosts, is potentially contributing to further evolution and dispersal of the virus.

Ethics and governance challenges related to genomic data sharing in southern Africa: the case of SARS-CoV-2

Authors: Prof Keymanthri Moodley, MBChB Nezerith Cengiz, MSc Med, Aneeka Domingo, MBChB Gonasagrie Nair, MBChB Adetayo Emmanuel Obasa, PhD Richard John Lessells, MBChB et al.
Year of Publication: 2022

Abstract

Data sharing in research is fraught with controversy. Academic success is premised on competitive advantage, with research teams protecting their research findings until publication. Research funders, by contrast, often require data sharing. Beyond traditional research and funding requirements, surveillance data have become contentious. Public health emergencies involving pathogens require intense genomic surveillance efforts and call for the rapid sharing of data on the basis of public interest. Under these circumstances, timely sharing of data becomes a matter of scientific integrity. During the COVID-19 pandemic, the transformative potential of genomic pathogen data sharing became obvious and advanced the debate on data sharing. However, when the genomic sequencing data of the omicron (B.1.1.529) variant was shared and announced by scientists in southern Africa, various challenges arose, including travel bans. The scientific, economic, and moral impact was catastrophic. Yet, travel restrictions failed to mitigate the spread of the variant already present in countries outside Africa. Public perceptions of the negative effect of data sharing are detrimental to the willingness of research participants to consent to sharing data in postpandemic research and future pandemics. Global health governance organisations have an important role in developing guidance on responsible sharing of genomic pathogen data in public health emergencies.

SARS-CoV-2 Africa dashboard for real-time COVID-19 information

Authors: Joicymara S. Xavier, Monika Moir, Houriiyah Tegally, Nikita Sitharam, Wasim Abdool Karim, James E. San, Joana Linhares, Eduan Wilkinson, David B. Ascher, Cheryl Baxter, Douglas E. V. Pires & Tulio de Oliveira
Year of Publication: 2023

Abstract

The SARS-CoV-2 Africa dashboard is an interactive tool that enables visualization of SARS-CoV-2 genomic information in African countries. The customizable app allows users to visualize the number of sequences deposited in each country, and the variants circulating over time. Our dashboard enables near real-time exploration of public data that can inform policymakers, healthcare professionals and the public about the ongoing pandemic.

Shifts in global mobility dictate the synchrony of SARS-CoV-2 epidemic waves

Authors: Houriiyah Tegally, MSc, Kamran Khan, MD, Carmen Huber, MSA, Tulio de Oliveira, PhD, Moritz U G Kraemer, DPhil
Year of Publication: 2023

Abstract

Human mobility changed in unprecedented ways during the SARS-CoV-2 pandemic. In March and April 2020, when lockdowns and large travel restrictions began in most countries, global air-travel almost entirely halted (92% decrease in commercial global air travel in the months between February and April 2020). Initial recovery in global air travel started around July 2020 and subsequently nearly tripled between May and July 2021. Here, we aim to establish a preliminary link between global mobility patterns and the synchrony of SARS-CoV-2 epidemic waves across the world.
We compare epidemic peaks and human global mobility in two time periods: November 2020 to February 2021 (when just over 70 million passengers travelled) and November 2021 to February 2022 (when more than 200 million passengers travelled). We calculate the time interval during which continental epidemic peaks occurred for both of these time periods, and we calculate the pairwise correlations of epidemic waves between all pairs of countries for the same time periods.
We find that as air travel increases at the end of 2021, epidemic peaks around the world are more synchronous with one another, both globally and regionally. Continental epidemic peaks occur globally within a 20 day interval at the end of 2021 compared with 73 days at the end of 2020, and epidemic waves globally are more correlated with one another at the end of 2021.
This suggests that the rebound in human mobility dictates the synchrony of global and regional epidemic waves. In line with theoretical work, we show that in a more connected world, epidemic dynamics are more synchronized.

Global Expansion of SARS-CoV-2 Variants of Concern: Dispersal Patterns and Influence of Air Travel

Authors: Houriiyah Tegally, Eduan Wilkinson, Darren Martin, Monika Moir, Anderson Brito, Marta Giovanetti, Kamran Khan, Carmen Huber, Isaac I. Bogoch, James Emmanuel San, Joseph L.-H. Tsui, Jenicca Poongavanan, Joicymara S. Xavier, Darlan da S. Candido, Filipe Romero, Cheryl Baxter, Oliver G. Pybus, Richard Lessells, Nuno R. Faria, Moritz U.G. Kraemer, Tulio de Oliveira
Year of Publication: 2022

Abstract

In many regions of the world, the Alpha, Beta and Gamma SARS-CoV-2 Variants of Concern (VOCs) co-circulated during 2020-21 and fueled waves of infections. During 2021, these variants were almost completely displaced by the Delta variant, causing a third wave of infections worldwide. This phenomenon of global viral lineage displacement was observed again in late 2021, when the Omicron variant disseminated globally. In this study, we use phylogenetic and phylogeographic methods to reconstruct the dispersal patterns of SARS-CoV-2 VOCs worldwide. We find that the source-sink dynamics of SARS-CoV-2 varied substantially by VOC, and identify countries that acted as global hubs of variant dissemination, while other countries became regional contributors to the export of specific variants. We demonstrate a declining role of presumed origin countries of VOCs to their global dispersal: we estimate that India contributed <15% of all global exports of Delta to other countries and South Africa <1-2% of all global Omicron exports globally. We further estimate that >80 countries had received introductions of Omicron BA.1 100 days after its inferred date of emergence, compared to just over 25 countries for the Alpha variant.

Get the Latest Updates on News, Events and Everything Inform-Africa

©2026. INFORM Africa. All Rights Reserved

wpChatIcon