Getting Started
GaMPEN is written in Python and relies on the PyTorch deep learning library to perform all of its tensor operations.
Installation
Training and inference for GaMPEN requires Python 3.7+. Trained GaMPEN models can be run on a CPU to perform inference, but training a model requires access to a CUDA-enabled GPU for reasonable training times.
Create a new conda environment with Python 3.7+. Extensive instructions on how to create a conda enviornment can be found here. Of course, you could use any other method of creating a virtual environment, but we will assume you are using conda for the rest of this guide.
conda create -n gampen python=3.7
Activate the new environment
conda activate gampen
Navigate to the directory where you want to install GaMPEN and then clone this repository with
git clone https://github.com/aritraghsh09/GaMPEN.git
Navigate into the root directory of GaMPEN
cd GaMPEN
Install all the required dependencies with
make requirements
To confirm that the installation was successful, run
make check
It is okay if there are some warnings or some tests are skipped. The only thing you should look out for is errors produced by the make check
command.
If you get an error about specific libcudas
libraries being absent while running make check
, this has probably to do with the fact that you don’t have the appropriate CUDA and cuDNN versions installed for the PyTorch version being used by GaMPEN. See below for more details about GPU support.
GPU Support
GaMPEN can make use of multiple GPUs while training if you pass in the appropriate arguments to the train.py
script.
To check whether the GaMPEN is able to detect GPUs, type python
into the command line from the root directory and run the following command:
from ggt.utils.device_utils import discover_devices
discover_devices()
The output should be cuda
if GaMPEN can detect a GPU.
If the output is cpu
then GaMPEN couldn’t find a GPU. This could be because you don’t have a GPU, or because you haven’t installed the appropriate CUDA and cuDNN versions.
If you are using an NVIDIA GPU, then you can use this link for more details about specific CUDA and cuDNN versions that are compatible with different PyTorch versions. To check the version of PyTorch you are using, type python
into the command line and then run the following code-block:
import torch
print(torch.__version__)
Quickstart
The core parts of the GaMPEN ecosystem are :-
Placing your data in a specific directory structure
Using the
GaMPEN/ggt/data/make_splits.py
script to generate train/devel/test splitsUsing the
GaMPEN/ggt/train/train.py
script to train a GaMPEN modelUsing the MLFlow UI to monitor your model during and after training
Using the
GaMPEN/ggt/modules/inference.py
script to perform predictions using the trained model.Using the
GaMPEN/ggt/modules/result_aggregator.py
script to aggregate the predictions into an easy-to-read pandas data-frame.
Attention
We highly recommend going through all our Tutorials to get an in-depth understanding of how to use GaMPEN, and an overview of all the steps above.
Here, we provide a quick-and-dirty demo to just get you started training your 1st GaMPEN model! This section is intentionally short without much explanation.
Data preparation
Let’s download some simulated Hyper Suprime-Cam (HSC) images from the Yale servers. To do this run from the root directory of this repository:
make demodir=./../hsc hsc_demo
This should create a directory called hsc
at the specified demodir
path with the following components
- hsc
- info.csv -- file names of the trianing images with labels
- cutouts/ -- 67 images to be used for this demo
Now, let’s split the data into train, devel, and test sets. To do this, run
python ./ggt/data/make_splits.py --data_dir=./../hsc/ --target_metric='bt'
This will create another folder called splits
within hsc
with the different data-splits for training, devel (validation), and testing.
Running the trainer
Run the trainer with
python ggt/train/train.py \
--experiment_name='demo' \
--data_dir='./../hsc/' \
--split_slug='balanced-dev2' \
--batch_size=16 \
--epochs=2 \
--lr=5e-7 \
--momentum=0.99 \
--crop \
--cutout_size=239 \
--target_metrics='custom_logit_bt,log_R_e,log_total_flux' \
--repeat_dims \
--no-nesterov \
--label_scaling='std' \
--dropout_rate=0.0004 \
--loss='aleatoric_cov' \
--weight_decay=0.0001 \
--parallel
To list the all possible options along with explanations, head to the Using GaMPEN page or run
python ggt/train/train.py --help
Launching the MLFlow UI
Open a separate shell and activate the virtual environment that the model is training in. Then, run
mlflow ui
Now navigate to http://localhost:5000/
to access the MLFlow UI which will show you the status of your model training.
MLFLow on a remote machine
First, on the server/HPC, navigate to the directory from where you initiated your GaMPEN run (you can do this on separate machine as well – only the filesystem needs to tbe same). Then execute the following command
mlflow ui --host 0.0.0.0
The --host
option is important to make the MLFlow server accept connections from other machines.
Now from your local machine tunnel into the 5000
port of the server where you ran the above command.
F
or example, let’s say you are in an HPC environment, where the machine where you ran the above command is named server1
and the login node to your HPC is named hpc.university.edu
and you have the username astronomer
. Then to forward the port you should type the following command in your local machine
ssh -N -L 5000:server1:5000 astronomer@hpc.university.edu
If performing the above step without a login node (e.g., a server whhich has the IP server1.university.edu
), you should be able to do
ssh -N -L 5000:localhost:5000 astronomer@server1.university.edu
After forwarding, if you navigate to http://localhost:5000/
you should be able to access the MLFlow UI
Training on other datasets
First create the necessary directories with
mkdir -p (dataset-name)/cutouts
Place FITS files in
(dataset-name)/cutouts
.Provide a file titled
info.csv
at(dataset-name)
. This file should have (at least) a column titledfile_name
(corresponding to the names of the files in(dataset-name)/cutouts
), a column titledobject_id
(with a unique ID of each file in(dataset-name)/cutouts
) and one column for each of the variables that you are trying to predict. For example, if you are trying to predict the radius of a galaxy, you would have a column titledradius
.Generate train/devel/test splits with
python ggt/data/make_splits.py --data_dir=data/(dataset-name)/
The make_splits.py
file splits the dataset into a variety of splits and you can choose to use any of these for your analysis. Details of the various splits are mentioned on the Using GaMPEN page.
After generating the splits, the dataset-name
directory should look like this:
- dataset_name
- info.csv
- cutouts/
- splits/
Follow the instructions under Running the trainer.