J Syst Evol

• Research Article • Previous Articles    

PyNCBIminer: A platform for assembling phylogenetic data sets via GenBank datamining

Ruijing Cheng1, Yang Yi1, Xiaohan Wang1, Xin Liang1, Nawal Shrestha2,3, Dimitar Dimitrov4, Zhiheng Wang5, Pengshan Zhao6, and Xiaoting Xu1*   

  1. 1Key Laboratory of Bio‐Resource and Eco‐Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
    2Department of Organismic and Evolutionary Biology, Faculty of Arts and Sciences, Harvard University Herbaria, 22 Divinity Avenue, Cambridge, MA 02138, USA
    3Department of Agriculture, School of Science, Kathmandu University, Dhulikhel, Kavre, Nepal
    4Department of Natural History, University Museum of Bergen, University of Bergen, P.O. Box 7800, 5020 Bergen, Norway
    5Institute of Ecology and Key Laboratory for Earth Surface Processes of the Ministry of Education, College of Urban and Environmental Sciences, Peking University, Beijing 100871, China
    6Key Laboratory of Ecological Safety and Sustainable Development in Arid Lands, Northwest Institute of EcoEnvironment and Resources, Chinese Academy of Sciences, Lanzhou 730000, China

    *Author for correspondence. E‐mail: xiaotingxu@scu.edu.cn
  • Received:2024-08-04 Accepted:2024-11-11 Online:2025-01-03 Published:2024-11-20
  • Supported by:
    This research was supported by the Natural Science Foundation of Sichuan Province (2023NSFSC1280), the Western Light Project of Chinese Academy of Science (xbzg‐zdsys‐202204), Scientific and Technological innovation project of China Academy of Chinese Medical Sciences (CI2021A03908) and Fundamental Research Funds for the Central Universities (SCU2024D003).

Abstract: Large phylogenies derived from publicly available genetic sequences are becoming a popular and indispensable tool in addressing core questions in ecology and evolution, as well as in tackling challenging conservation issues. Optimizing taxonomic coverage and data quality is essential for improving the precision and reliability of phylogenetic reconstructions and evolutionary inferences. Here we present PyNCBIminer, a user-friendly software that automates the assembly of large DNA data sets from GenBank for phylogenetic reconstruction using the supermatrix method. PyNCBIminer uses the iterative BLAST procedure to retrieve genetic sequences accurately and efficiently from GenBank. The state-of-the-art strategies also serve to improve taxa coverage and the quality of target DNA markers. PyNCBIminer is designed to efficiently handle large data sets, but it is also suitable for medium and small data sets. It is open source and freely available at GitHub (https://github.com/Xiaoting-Xu/PyNCBIminer) and Gitee (https://gitee.com/xiaotingxu/PyNCBIminer). Its utility and performance are demonstrated through the assembly of phylogenetic data sets encompassing several genetic markers of varying sizes for the angiosperm order Dipsacales. PyNCBIminer holds an advantage over similar programs in that it performs the majority of computations on the NCBI server, eliminating the need for users to build and maintain large local databases and reducing the demands on their computers. In addition, it integrates other commonly used phylogenetic analysis software, providing users from various backgrounds with convenient options for retrieving and assembling GenBank sequence data, along with flexible features that allow for user-defined parameters and strategies.

Key words: BLAST, GenBank, nucleotide sequences, phylogeny, supermatrix.