Getting Started
GaMPEN is written in Python and relies on the PyTorch deep learning library to perform all of its tensor operations.
Installation
Training and inference for GaMPEN requires Python 3.7 or 3.8. Trained GaMPEN models can be run on a CPU to perform inference, but training a model requires access to a CUDA-enabled GPU for reasonable training times.
Create a new conda environment with Python 3.7 or 3.8. Extensive instructions on how to create a conda enviornment can be found here. Of course, you could use any other method of creating a virtual environment, but we will assume you are using conda for the rest of this guide.
conda create -n gampen python=3.7
Activate the new environment
conda activate gampen
Navigate to the directory where you want to install GaMPEN and then clone this repository with
git clone https://github.com/aritraghsh09/GaMPEN.git
Navigate into the root directory of GaMPEN
cd GaMPEN
Install all the required dependencies with
make requirements
To confirm that the installation was successful, run
make check
It is okay if there are some warnings or some tests are skipped. The only thing you should look out for is errors produced by the make check
command.
Tip
If you get an error about specific libcudas
libraries being absent while running make check
, this has probably to do with the fact that you don’t have the appropriate CUDA
and cuDNN
versions installed for the PyTorch version being used by GaMPEN. See below for more details about GPU support.
GPU Support
GaMPEN can make use of multiple GPUs while training if you pass in the appropriate arguments to the train.py
script.
To check whether the GaMPEN is able to detect GPUs, type python
into the command line from the root directory and run the following command:
from ggt.utils.device_utils import discover_devices
discover_devices()
The output should be cuda
if GaMPEN can detect a GPU.
If the output is cpu
then GaMPEN couldn’t find a GPU. This could be because you don’t have a GPU, or because you haven’t installed the appropriate CUDA and cuDNN versions.
If you are using an NVIDIA GPU, then you can use this link for more details about specific CUDA and cuDNN versions that are compatible with different PyTorch versions. To check the version of PyTorch you are using, type python
into the command line and then run the following code-block:
import torch
print(torch.__version__)
Quickstart
The core steps involved in running GaMPEN include :-
Placing your data in a specific directory structure
Using the
GaMPEN/ggt/data/make_splits.py
script to generate train/devel/test splitsUsing the
GaMPEN/ggt/train/train.py
script to train a GaMPEN modelUsing the MLFlow UI to monitor your model during and after training
Using the
GaMPEN/ggt/modules/inference.py
script to perform predictions using the trained model.Using the
GaMPEN/ggt/modules/result_aggregator.py
script to aggregate the predictions into an easy-to-read pandas data-frame.
Attention
We strongly recommend going through our Tutorials and Using GaMPEN pages to get an in-depth understanding of how to use GaMPEN, and an overview of all the steps above.
Here, we provide a quick-and-dirty demo to just get you started training your 1st GaMPEN model! This section is intentionally short, without much explanation.
Data preparation
Let’s download some simulated Hyper Suprime-Cam (HSC) images from the Yale servers. To do this run from the root directory of this repository:
make demodir=./../hsc hsc_demo
This should create a directory called hsc
at the specified demodir
path with the following components
- hsc
- info.csv -- file names of the trianing images with labels
- cutouts/ -- 67 images to be used for this demo
Now, let’s split the data into train, devel, and test sets. To do this, run
python ./ggt/data/make_splits.py --data_dir=./../hsc/ --target_metric='bt'
This will create another folder called splits
within hsc
with the different data-splits for training, devel (validation), and testing.
Running the trainer
Let’s use the data we just downloaded to train a GaMPEN model. To do this, we will
use the train.py
script:-
python ggt/train/train.py \
--experiment_name='demo' \
--data_dir='./../hsc/' \
--split_slug='balanced-dev2' \
--batch_size=16 \
--epochs=2 \
--lr=5e-7 \
--momentum=0.99 \
--crop \
--cutout_size=239 \
--target_metrics='custom_logit_bt,ln_R_e_asec,ln_total_flux_adus' \
--repeat_dims \
--no-nesterov \
--label_scaling='std' \
--dropout_rate=0.0004 \
--loss='aleatoric_cov' \
--weight_decay=0.0001 \
--parallel
To list the all possible options along with explanations, head to the Using GaMPEN page or run
python ggt/train/train.py --help
Launching the MLFlow UI
Although, this is an optional step, GaMPEN comes pre-installed with the MLFlow UI to help you monitor the different models that you are currently training and compare these with models you have trained in the past.
To initialize MLFlow, open a separate shell and activate the virtual environment that the model is training in. Then, navigate to the directory from where you initiated the training run and execute the following command:-
mlflow ui
Now navigate to http://localhost:5000/
to access the MLFlow UI which will show you the status of your model training.
Warning
If you are running these commands on a server/remote machine, you will need to follow the additional instructions listed below to access the MLFlow UI.
MLFLow on a remote machine
First, on the server/HPC system, open a shell and activate the virtual environment that the model is training in. Thereafter, navigate to the directory from where you initiated your GaMPEN run (you can do this on separate machine as well – only the filesystem needs to tbe same). Then, execute the following command:-
mlflow ui --host 0.0.0.0
The --host
option is important to make the MLFlow server accept connections from other machines.
Now, from your local machine, tunnel into the 5000
port of the server where you ran the above command. After forwarding, if you navigate to http://localhost:5000/
you should be able to access the MLFlow UI.
Tip
For example, let’s say you are working in an HPC environment, where the machine where you ran the above command is named server1
and the login node to your HPC system is named hpc.university.edu
and you have the username astronomer
. Then, to establish port-forwarding, you should type the following command on your local machine:-
ssh -N -L 5000:server1:5000 astronomer@hpc.university.edu
If performing the above step without a login node (e.g., a server which has the IP server1.university.edu
), you should be able to
establish port-forwarding simply with:-
ssh -N -L 5000:localhost:5000 astronomer@server1.university.edu
Training on other datasets
In Running the trainer section, we demonstrated how to train a GaMPEN model on the demo HSC dataset. Below, we outline the core steps involved in training a model on your own data:-
Create the necessary directory structure with:-
mkdir -p dataset-name/cutouts
where dataset-name
can be any name of your choosing.
GaMPEN expects input images to be in the
.fits
format, centered on the galaxy of interest. Make same-sized individual cutouts for all objects in your dataset; and place these files indataset-name/cutouts/
.Place a file titled
info.csv
inside thedataset-name
directory. This file should have (at least) a column titledfile_name
(corresponding to the names of the files indataset-name/cutouts
), a column titledobject_id
(with a unique ID for each file indataset-name/cutouts/
) and one column each for the parameters that you are trying to predict. For example, if you are trying to predict the radius and magnitude of a galaxy, you would have two columns titledradius
andmagnitude
.
Note
Besides the file_name
and object_id
columns; all other columns in info.csv
can be named according to your choosing.
There are also no limitations on additional columns being present in info.csv
.
Next, separate your dataset into train, devel, and test splits with the following command:-
python ggt/data/make_splits.py --data_dir=/dataset-name/
Attention
You should provide the full path of the dataset-name
directory to the data_dir
argument.
The make_splits.py
file splits the dataset according to a set of pre-determined fractions and you can choose to use any of these for your analysis. Details of the various splits are mentioned on the Using GaMPEN page.
After generating the splits, the dataset-name
directory should look like this:
- dataset_name
- info.csv
- cutouts/
- splits/
Tip
To change the fractions (of train/devel/test data) in the various splits; alter the split_types
dictionary in make_splits.py
Follow the instructions in Running the trainer.