This is the final post in a short series of blogs on package management.
The first post explained the role of repositories and libraries.
The second post explored package management pain shows up in different organizations.
As a Solutions Engineer at RStudio, I spend a lot of time helping data science teams figure out their package management needs.
I often meet with IT/Admins frustrated with trying to provide data scientists with the packages they need while also maintaining stability and security. I also speak with data scientists discouraged and annoyed at how hard it is to gets the open source R and Python packages they need.
The resulting cat-and-mouse games often end in creative detentes – idiosyncratic package management strategies that kinda work for everyone involved. On the other hand, organizations with secure, low-friction package management strategies seem to follow just a few patterns.
familiespackage management processes are all alike; every unhappy familypackage management process is unhappy in its own way.”
-Leo Tolstoy, Anna Karenina (sorta)
In this post, I’ll share the common components I see at organizations where IT/Admins and data scientists both contribute to a package environment that is secure, reproducible, and easy to use.
Divvying Up Responsibility
While the details of package management differ widely from one organization to another, organizations with secure, low-friction package management processes usually exhibit a three-part framework, with clear ownership of each part.
One way this pattern can go awry is that admins, trying to be helpful, decide to take control of the package libraries themselves. We previously explored why admins controlling repositories and data scientists controlling libraries tends to be a much lower-friction way to manage package environments.
Part I: Add packages to repositories
In most organizations with good package management processes, admins decide whether a private package repository is needed and, if so, what packages are in the organization’s shared package repositories.1 Many organizations decide that a public package repository like CRAN, PyPI, or public RStudio Package Manager is sufficient.
In other organizations, there may not be open access to the internet, packages might need to be validated before they can be used, or there might be heavy usage of internally-developed packages. In these cases, organizations configure an internal CRAN or PyPI mirror. RStudio Package Manager is RStudio’s professional product for this purpose.
Data scientists and admins trying to choose the right configuration for their organization might want to consider the pain points explored in the previous post in this series as well as the decision tree on the RStudio solutions site.
Part II: Set defaults so things “just work”
Once security concerns are satisfied, admins spend a lot of time making sure that data scientists can get to work as soon as they enter their data science environment. Admins want to ensure data scientists have all the packages they need.
It often works well for admins to set default settings for users so package installs just work. Admins generally set appropriate default repositories and install required system libraries. Some admins additionally choose to install a “starter” package set for all users.
More details on how to do all those things are on the RStudio Solutions site.
Many organizations choose to centralize all of their data scientists on RStudio Server Pro to simplify the administration.
Part III: Use and capture reproducible project environments
The last step of the process is data scientists doing their work! If admins have successfully configured a repository and package defaults, this should be an extremely low-friction process for data scientists, even if they’re inside an air-gapped or validated environment.
In the best case, data scientists use project-level isolation of packages using tools like renv and virtualenv to ensure project package libraries are isolated, reproducible, and shareable.
Great Process Leads to Great Outcomes
A three-part package management plan allows admins to be confident that their network is secure and that data scientists aren’t blocked trying to acquire the packages they need. Data scientists are also able to access and use the packages they need to do their work.
Within the three-part structure, organizations’ package needs are as varied as the organizations themselves, and an an earlier blog post explored why teams make different choices within this framework.
If you think your organization could benefit from more information on package management, contact our sales team to learn more about how RStudio Package Manager and RStudio Server Pro work together to make it easy for admins to create safe, low-friction environment for data scientists to be productive.
For more on this topic, please see the recording of our free webinar on Managing Packages for Open-Source Data Science.
- Sometimes this package admin is a member of the IT or DevOps organization, and sometimes they’re a data scientist. ↩