Database Setup

The previous page explained how users can examine and execute individual calculations one at a time to understand how the methods work and how to use them. However, when running the calculations in that manner, the collecting and managing of the generated results is left up to the user. Settings in iprPy allow for interactions with one or more databases to be defined that then allow for any generated results to be automatically collected together. This provides a convenient means of storing the results together in a common format. Collecting results into a database is also a major benefit when performing high throughput calculations as the vast number of simulations performed almost necessitates some infrastructure to query subsets of the results and perform common analyses methods across similar records.

1. Basic Settings

Many default settings of how iprPy behaves can be adjusted using the methods of iprPy.settings. Updates to settings are stored in a local file allowing for future sessions to adhere to the changes that have been made. Additionally, the same settings file is used by potentials, atomman and iprPy so changes to one will be reflected by the others if the terms are supported by the other packages.

The basic settings terms that can be changed are

  • directory is the root directory where the settings file is saved to. The default value is a folder in the user’s home directory. Most users will likely not want to change this, but doing so is available for edge cases.

  • runner_log_directory is the directory where any logs generated by runners are saved to. These logs can be useful for debugging as they state the status of all calculations that were performed by each runner. The default value is a folder named “runner-logs” inside the above directory.

1.1. Command line

$ iprPy directory

will print the current directory that contains the settings file.

$ iprPy set_directory <path>

will change the settings directory to the specified path.

$ iprPy unset_directory

will revert the settings directory back to its default path in the user’s home directory.

$ iprPy runner_log_directory

will print the current directory where runner log files will be saved.

$ iprPy set_runner_log_directory <path>

will change the runner log directory to the specified path.

$ iprPy unset_runner_log_directory

will revert the runner log directory back to “runner-logs” inside the settings directory.

1.2. Python

All settings options can be accessed using iprPy.settings. The attributes and method names are consistent with the command line options

import iprPy

# Check the current settings
print(iprPy.settings.directory)
print(iprPy.settings.runner_log_directory)

# Change the directory paths
iprPy.settings.set_directory(NEWPATH1)
iprPy.settings.set_runner_log_directory(NEWPATH2)

# Revert the directory paths back to their default values
iprPy.settings.unset_directory()
iprPy.settings.unset_runner_log_directory()

2. Define databases

Being able to interact with multiple databases and types of databases is a feature of iprPy that has many benefits

  • The design means that the high throughput operations are not explicitly tied to a single database or database infrastructure.

  • Multiple databases makes it possible to separate calculation runs based on specific studies. This allows for targeted investigations or parameter studies to be performed without polluting a “main” database.

  • Simulations can be set up and executed offline using a local database, then the results uploaded to a remote database when finished.

  • Accessing databases of different styles makes it possible to use the specific advantages of each unique infrastructure.

Interactions with databases are managed using Database objects. Each Database object is initialized by specifying the style of database and any access terms. The access settings for any database can be saved to iprPy’s settings allowing for them to be retrieved later using a simple name string. Saving the database settings in this way also allows for the terminal commands to know which database to interact with.

NOTE: All access settings are stored in the iprPy settings file as unencrypted text. If your database requires a password to access, only save it to the settings if no one else can access the directory containing the settings file. If the database requires a password and the password is not set, then a prompt will ask for the password every time a Database object is created.

2.1. Database styles

iprPy currently supports the following database styles. This list could be expanded to support any other database infrastructures that primarily rely on JSON- or XML-like records.

  • local style databases exist simply as directories on the local machine containing a collection of JSON or XML files and supporting csv cache files. The advantages of the local style are that it requires no additional setup or installations and the included files can be directly explored in a text editor. The disadvantages are that there are some performance issues when accessing a record style that contains a large number of entries and the directory has to be directly accessible by the local operating system.

  • mongo style databases interact with MongoDB instances. The advantages of interacting with a MongoDB database are improved performance and the ability to interact with both local and remote databases. The disadvantages are that it does require that a MongoDB be set up and accessible and exploring the contained records requires special software or that the records be downloaded. Individual installs of MongoDB and the supporting software are free and relatively easy, so it is recommended to use a mongo style database if you plan on performing a large number of calculations.

  • cdcs style databases interact with CDCS instances. The https://potentials.nist.gov database is a CDCS database and contains records for the content that appears on the NIST Interatomic Potentials Repository as well as input settings parameters used by the iprPy calculations. The advantages of a CDCS database are that it allows for public access of stored records through either a website or REST APIs with user-access controls and can automatically render stored records using XSLT. The disadvantages of a CDCS database are that installation is more involved than the other two styles and interactions are slower than MongoDB interactions.

2.2. Command line

$ iprPy list_databases

will display a list of the currently set database names.

$ iprPy database <database_name>

will display the basic information associated with the database named. Specifically, the database’s style and host path or url.

$ iprPy set_database <database_name>

allows for database access parameters to be set and assigned to the given name. Prompts will ask for the database style and host. Following that, additional prompts allow for any other access settings to be defined one at a time by first specifying a parameter name and then the value. The additional prompts will end when no parameter name is given.

$ iprPy unset_database <database_name>

will delete the saved access settings associated with the database name. Note that this only removes the saved access settings and does not affect the database itself in any way.

2.3. Python

A database object can be initialized with load_database by giving the database style, host, and any extra database-style-specific access parameters. Alternatively, load_database can simply take the name of a previously set database and it will create the database object based on the saved settings.

newdb = iprPy.load_database(style='local', host='/users/me/myDB')
olddb = iprPy.load_database(<database_name>)

The settings operations related to saving database access settings are

iprPy.settings.list_databases

returns the list of names associated with saved database settings.

iprPy.settings.databases

returns a dictionary of database access settings where the keys are the database names and the values are dicts that contain the associated access parameters.

iprPy.settings.set_database(<database_name>, <style>, <host>, **kwargs)

allows for database access settings to be saved under the given database name.

iprPy.settings.unset_database(<database_name>)

removes the saved settings associated with the database name. Note that this only removes the access parameters from the settings file and does not affect the database itself in any way.

2.4. Other Database information

The Database objects in iprPy behave slightly differently than the Database objects of the potentials and atomman packages. This primarily arises from the fact that potentials and atomman predominately access reference data sets while iprPy is designed to generate and manage calculation data. The main differences can be summed up as

  • A potentials/atomman Database provides a wrapper around two database locations: a local location and a remote location. By default, the local location is a local-style database and the remote is the CDCS https://potentials.nist.gov/. This provides a convenient means of exploring reference records that could exist either locally or in the official remote database.

  • An iprPy Database interacts with a single specified database location that stores both reference inputs and calculation results.

  • The iprPy Database objects have a “potdb” attribute and a “build_potdb()” method that return a potentials/atomman Database object where the local database is set to the iprPy’s database location. This gives each iprPy Database the ability to access the convenience methods of the potentials/atomman Database class.

  • The iprPy Databases also contain additional methods associated with managing records for high throughput calculations.

Related to the potentials/atomman Databases, there are a few extra settings parameters that can be changed and controlled. All of these have corresponding set methods and the kim parameters also have unset methods.

  • remote controls the default value of the similarly named parameter used to initialize potentials/atomman Databases and indicates if Database queries will search the remote location.

  • local controls the default value of the similarly named parameter used to initialize potentials/atomman Databases and indicates if Database queries will search the local location.

  • pot_dir_style controls the default value of the similarly named parameter used to initialize potentials/atomman Databases and indicates what the pot_dir values of retrieved LAMMPS potentials are set to.

  • kim_api_directory is used to determine which KIM models are included in any returned queries of LAMMPS potentials. By default, the query operation will use the kim-api-collections-management script in the specified directory to determine the list of installed KIM models.

  • kim_models_file is used to specify which KIM models are included in any returned queries of LAMMPS potentials. By default, the query operation will read the list of KIM models from this file. This is alternate to the kim_api_directory parameter and makes it possible to explore KIM models that may be installed on other resources.

3. Define run directories

A run directory is a local directory used when performing high throughput operations. Notably, it is where folders related to calculations to be performed are prepared, and then later executed by a runner.

Multiple run directories can be used and it is encouraged allowing for separate pools of calculations that may be associated with different databases or numbers of processors to use. While there is no inherent links between a database and a run directory, it is recommended as good practice that each run directory be used specifically for a single database. This is because running the calculations requires that the associated database contains an incomplete record associated with every calculation being performed in the run directory.

3.1. Command line

$ iprPy list_run_directories

will display a list of the currently set run directory names.

$ iprPy run_directory <run_directory_name>

will display the path associated with the run directory named. .. code-block:: bash

$ iprPy set_run_directory <run_directory_name>

allows for a run directory to be set according to the given run directory name. A prompt will ask for the path to associate with the name.

$ iprPy unset_run_directory <run_directory_name>

will delete the saved settings associated with the run directory name.

3.2. Python

run_directory = iprPy.load_run_directory(<run_directory_name>)

will retrieve the run directory path associated with the given run directory name.

iprPy.settings.list_run_directories

returns a list of the run directory names that are currently saved to the settings.

iprPy.settings.run_directories

returns a dict where the keys are the saved names of the set run directories and the values are the associated paths.

iprPy.settings.set_run_directory(<run_directory_name>, <path>)

saves a given run directory path under the specified name.

iprPy.settings.unset_run_directory(<run_directory_name>)

deletes the saved settings associated with the run directory name.