Getting Started

GaMPEN is written in Python and relies on the PyTorch deep learning library to perform all of its tensor operations.

Installation

Training and inference for GaMPEN requires Python 3.10+. Trained GaMPEN models can be run on a CPU to perform inference, but training a model requires access to a CUDA-enabled GPU for reasonable training times.

Create a new conda environment with Python 3.10. Extensive instructions on how to create a conda enviornment can be found here. Of course, you could use any other method of creating a virtual environment, but we will assume you are using conda for the rest of this guide.

conda create -n gampen python=3.10

Activate the new environment

conda activate gampen

Navigate to the directory where you want to install GaMPEN and then clone this repository with

git clone https://github.com/aritraghsh09/GaMPEN.git

Navigate into the root directory of GaMPEN

cd GaMPEN

Install all the required dependencies with

make requirements

To confirm that the installation was successful, run

make check

It is okay if there are some warnings or some tests are skipped. The only thing you should look out for is errors produced by the make check command.

Tip

If you get an error about specific libcudas libraries being absent while running make check, this has probably to do with the fact that you don’t have the appropriate CUDA and cuDNN versions installed for the PyTorch version being used by GaMPEN. See below for more details about GPU support.

GPU Support

GaMPEN can make use of multiple GPUs while training if you pass in the appropriate arguments to the train.py script.

To check whether the GaMPEN is able to detect GPUs, type python into the command line from the root directory and run the following command:

from ggt.utils.device_utils import discover_devices
discover_devices()

The output should be cudaif GaMPEN can detect a GPU.

If the output is cpu then GaMPEN couldn’t find a GPU. This could be because you don’t have a GPU, or because you haven’t installed the appropriate CUDA and cuDNN versions.

If you are using an NVIDIA GPU, then you can use this link for more details about specific CUDA and cuDNN versions that are compatible with different PyTorch versions. To check the version of PyTorch you are using, type python into the command line and then run the following code-block:

import torch
print(torch.__version__)

Quickstart

The core steps involved in running GaMPEN include :-

Placing your data in a specific directory structure
Using the GaMPEN/ggt/data/make_splits.py script to generate train/devel/test splits
Using the GaMPEN/ggt/train/train.py script to train a GaMPEN model
Using the MLFlow UI to monitor your model during and after training
Using the GaMPEN/ggt/modules/inference.py script to perform predictions using the trained model.
Using the GaMPEN/ggt/modules/result_aggregator.py script to aggregate the predictions into an easy-to-read pandas data-frame.

Attention

We strongly recommend going through our Tutorials and Using GaMPEN pages to get an in-depth understanding of how to use GaMPEN, and an overview of all the steps above.

Here, we provide a quick-and-dirty demo to just get you started training your 1st GaMPEN model! This section is intentionally short, without much explanation.

Data preparation

Let’s download some simulated Hyper Suprime-Cam (HSC) images from the Yale servers. To do this run from the root directory of this repository:

make demodir=./../hsc hsc_demo

This should create a directory called hsc at the specified demodir path with the following components

- hsc
  - info.csv -- file names of the trianing images with labels
  - cutouts/ -- 67 images to be used for this demo

Now, let’s split the data into train, devel, and test sets. To do this, run

python ./ggt/data/make_splits.py --data_dir=./../hsc/ --target_metric='bt'

This will create another folder called splits within hsc with the different data-splits for training, devel (validation), and testing.

Running the trainer

Let’s use the data we just downloaded to train a GaMPEN model. To do this, we will use the train.py script:-

python ggt/train/train.py \
  --experiment_name='demo' \
  --data_dir='./../hsc/' \
  --split_slug='balanced-dev2' \
  --batch_size=16 \
  --epochs=2 \
  --lr=5e-7 \
  --momentum=0.99 \
  --crop \
  --cutout_size=239 \
  --target_metrics='custom_logit_bt,ln_R_e_asec,ln_total_flux_adus' \
  --repeat_dims \
  --no-nesterov \
  --label_scaling='std' \
  --dropout_rate=0.0004 \
  --loss='aleatoric_cov' \
  --weight_decay=0.0001 \
  --parallel

To list the all possible options along with explanations, head to the Using GaMPEN page or run

python ggt/train/train.py --help

Launching the MLFlow UI

Although, this is an optional step, GaMPEN comes pre-installed with the MLFlow UI to help you monitor the different models that you are currently training and compare these with models you have trained in the past.

To initialize MLFlow, open a separate shell and activate the virtual environment that the model is training in. Then, navigate to the directory from where you initiated the training run and execute the following command:-

mlflow ui

Now navigate to http://localhost:5000/ to access the MLFlow UI which will show you the status of your model training.

Warning

If you are running these commands on a server/remote machine, you will need to follow the additional instructions listed below to access the MLFlow UI.

MLFLow on a remote machine

First, on the server/HPC system, open a shell and activate the virtual environment that the model is training in. Thereafter, navigate to the directory from where you initiated your GaMPEN run (you can do this on separate machine as well – only the filesystem needs to tbe same). Then, execute the following command:-

mlflow ui --host 0.0.0.0

The --host option is important to make the MLFlow server accept connections from other machines.

Now, from your local machine, tunnel into the 5000 port of the server where you ran the above command. After forwarding, if you navigate to http://localhost:5000/ you should be able to access the MLFlow UI.

Tip

For example, let’s say you are working in an HPC environment, where the machine where you ran the above command is named server1 and the login node to your HPC system is named hpc.university.edu and you have the username astronomer. Then, to establish port-forwarding, you should type the following command on your local machine:-

ssh -N -L 5000:server1:5000 astronomer@hpc.university.edu

If performing the above step without a login node (e.g., a server which has the IP server1.university.edu), you should be able to establish port-forwarding simply with:-

ssh -N -L 5000:localhost:5000 astronomer@server1.university.edu

Training on other datasets

In Running the trainer section, we demonstrated how to train a GaMPEN model on the demo HSC dataset. Below, we outline the core steps involved in training a model on your own data:-

Create the necessary directory structure with:-

mkdir -p dataset-name/cutouts

where dataset-name can be any name of your choosing.

GaMPEN expects input images to be in the .fits format, centered on the galaxy of interest. Make same-sized individual cutouts for all objects in your dataset; and place these files in dataset-name/cutouts/.
Place a file titled info.csv inside the dataset-name directory. This file should have (at least) a column titled file_name (corresponding to the names of the files in dataset-name/cutouts), a column titled object_id (with a unique ID for each file in dataset-name/cutouts/) and one column each for the parameters that you are trying to predict. For example, if you are trying to predict the radius and magnitude of a galaxy, you would have two columns titled radius and magnitude.

Note

Besides the file_name and object_id columns; all other columns in info.csv can be named according to your choosing. There are also no limitations on additional columns being present in info.csv.

Next, separate your dataset into train, devel, and test splits with the following command:-

python ggt/data/make_splits.py --data_dir=/dataset-name/

Attention

You should provide the full path of the dataset-name directory to the data_dir argument.

The make_splits.py file splits the dataset according to a set of pre-determined fractions and you can choose to use any of these for your analysis. Details of the various splits are mentioned on the Using GaMPEN page.

After generating the splits, the dataset-name directory should look like this:

- dataset_name
    - info.csv
    - cutouts/
    - splits/

Tip

To change the fractions (of train/devel/test data) in the various splits; alter the split_types dictionary in make_splits.py

Follow the instructions in Running the trainer.