Scientific software is not the same as traditional, commercial software. It varies not only on how it is developed, who develops it, and the purpose it has. However, software engineering knowledge applied to data science remains seldom studied.
In this project, you'll read and classify (first, manually, then automatically) GitHub issues on large R/Pythong packages (exclusive data science packages), to determine topics and issues developers are facing. GitHub is the main hub for open-source collaboration, and R/Python are a package-based environment--meaning that other software depends on packages, and having issues addressed quickly (as well as questions solved) is essential to ensure the quality of results derived by using/importing a package. You will perform manual classifications, topic modelling, sentiment analysis, and analysis over static data (such as trends, life-span of issues, type of collaborators involved).
You will need to be systematic, and eager to uncover what issues data scientists are facing. The work is mostly manual, but the dataset and taxonomy that will be generated will be essential for future works.
- Programming knowledge, preferably either Python or R. Other languages are welcome but not needed.
- Knowledge (or willingness to learn quickly) about using APIs to download data.
- Demonstrated academic writing skills.
- Excellent attention to details.
Z. Codabux, M. Vidoni, F. Fard, "Technical Debt in the Peer-Review Documentation of R Packages: a rOpenSci Case Study," in 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021 pp. 195-206. https://doi.ieeecomputersociety.org/10.1109/MSR525...