# Getting Started

GaMPEN is written in Python and relies on the [PyTorch](https://pytorch.org/) deep learning library to perform all of its tensor operations.

## Installation
Training and inference for GaMPEN requires Python 3.10+. Trained GaMPEN models can be run on a CPU to perform inference, but training a model requires access to a CUDA-enabled GPU for reasonable training times.

1. Create a new conda environment with Python 3.10. Extensive instructions on how to create a conda enviornment can be found [here](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands). Of course, you could use any other method of creating a virtual environment, but we will assume you are using conda for the rest of this guide.
```bash
conda create -n gampen python=3.10
```
2. Activate the new environment
```bash
conda activate gampen
```
3. Navigate to the directory where you want to install GaMPEN and then clone this repository with
```bash
git clone https://github.com/aritraghsh09/GaMPEN.git
```
4. Navigate into the root directory of GaMPEN 
```bash
cd GaMPEN
```
4. Install all the required dependencies with
```bash
make requirements
```
5. To confirm that the installation was successful, run
```bash
make check
```
It is okay if there are some warnings or some tests are skipped. The only thing you should look out for is errors produced by the `make check` command.

:::{tip}
If you get an error about specific `libcudas` libraries being absent while running `make check`, this has probably to do with the fact that you don't have the appropriate `CUDA` and `cuDNN` versions installed for the PyTorch version being used by GaMPEN. See [below](#gpu-support) for more details about GPU support.
:::

## GPU Support

GaMPEN can make use of multiple GPUs while training if you pass in the appropriate arguments to the `train.py` script.

To check whether the GaMPEN is able to detect GPUs, type `python` into the command line from the root directory and run the following command:
```python
from ggt.utils.device_utils import discover_devices
discover_devices()
```
The output should be `cuda`if GaMPEN can detect a GPU.

If the output is `cpu` then GaMPEN couldn't find a GPU. This could be because you don't have a GPU, or because you haven't installed the appropriate CUDA and cuDNN versions.

 If you are using an NVIDIA GPU, then you can use [this link](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for more details about specific CUDA and cuDNN versions that are compatible with different PyTorch versions. To check the version of PyTorch you are using, type `python` into the command line and then run the following code-block:

```python
import torch
print(torch.__version__)
```


## Quickstart

The core steps involved in running GaMPEN include :-

1. Placing your data in a specific directory structure
2. Using the `GaMPEN/ggt/data/make_splits.py` script to generate train/devel/test splits
3. Using the `GaMPEN/ggt/train/train.py` script to train a GaMPEN model
4. Using the MLFlow UI to monitor your model during and after training
5. Using the `GaMPEN/ggt/modules/inference.py` script to perform predictions using the trained model.
6. Using the `GaMPEN/ggt/modules/result_aggregator.py` script to aggregate the predictions into an easy-to-read pandas data-frame.

:::{attention}
We strongly recommend going through our [Tutorials](Tutorials.md) and [Using GaMPEN](Using_GaMPEN.md) pages to get an in-depth understanding of how to use GaMPEN, and an overview of all the steps above.

Here, we provide a quick-and-dirty demo to just get you started training your 1st GaMPEN model! This section is intentionally short, without much explanation.
:::

### Data preparation
Let's download some simulated Hyper Suprime-Cam (HSC) images from the Yale servers. To do this run from the root directory of this repository:

```bash
make demodir=./../hsc hsc_demo
```

This should create a directory called `hsc` at the specified `demodir` path with the following components

```text
- hsc
  - info.csv -- file names of the trianing images with labels
  - cutouts/ -- 67 images to be used for this demo
```

Now, let's split the data into train, devel, and test sets. To do this, run
```bash
python ./ggt/data/make_splits.py --data_dir=./../hsc/ --target_metric='bt'
```

This will create another folder called `splits` within `hsc` with the different data-splits for training, devel (validation), and testing.

### Running the trainer
Let's use the data we just downloaded to train a GaMPEN model. To do this, we will
use the `train.py` script:-

```bash
python ggt/train/train.py \
  --experiment_name='demo' \
  --data_dir='./../hsc/' \
  --split_slug='balanced-dev2' \
  --batch_size=16 \
  --epochs=2 \
  --lr=5e-7 \
  --momentum=0.99 \
  --crop \
  --cutout_size=239 \
  --target_metrics='custom_logit_bt,ln_R_e_asec,ln_total_flux_adus' \
  --repeat_dims \
  --no-nesterov \
  --label_scaling='std' \
  --dropout_rate=0.0004 \
  --loss='aleatoric_cov' \
  --weight_decay=0.0001 \
  --parallel
```
To list the all possible options along with explanations, head to the [Using GaMPEN](Using_GaMPEN.md) page or run
```bash
python ggt/train/train.py --help
```

### Launching the MLFlow UI

Although, this is an optional step, GaMPEN comes pre-installed with the [MLFlow UI](https://mlflow.org/) to help you 
monitor the different models that you are currently training and compare these with models you have trained in the past. 

To initialize MLFlow, open a separate shell and _activate the virtual environment_ that the model is training in. Then,
navigate to the directory from where you initiated the training run and execute the following command:-

```bash
mlflow ui
```
Now navigate to `http://localhost:5000/` to access the MLFlow UI which will show you the status of your model training.

:::{warning}
If you are running these commands on a server/remote machine, you will need to follow the additional instructions listed below
to access the MLFlow UI.
:::

#### MLFLow on a remote machine

First, on the server/HPC system, open a shell and _activate the virtual environment_ that the model is training in. Thereafter, navigate to the directory from where you initiated your GaMPEN run (you can do this on separate machine as well -- only the filesystem needs to tbe same). Then, execute the following command:-

```bash
mlflow ui --host 0.0.0.0
```

The `--host` option is important to make the MLFlow  server accept connections from other machines. 

Now, from your local machine, tunnel into the `5000` port of the server where you ran the above command. After forwarding, if you navigate to  `http://localhost:5000/` you should be able to access the MLFlow UI.

:::{tip}
For example, let's say you are working in an HPC environment, where the machine where you ran the above command is named `server1` and the login node to your HPC system is named `hpc.university.edu` and you have the username `astronomer`. Then, to establish port-forwarding, you should type the following command on your local machine:-

```bash
ssh -N -L 5000:server1:5000 astronomer@hpc.university.edu
```

If performing the above step without a login node (e.g., a server which has the IP `server1.university.edu`), you should be able to
establish port-forwarding simply with:- 

```bash
ssh -N -L 5000:localhost:5000 astronomer@server1.university.edu
```
:::


### Training on other datasets
In [Running the trainer](#running-the-trainer) section, we demonstrated how to train a GaMPEN model on the demo HSC
dataset. Below, we outline the core steps involved in training a model on your own data:-

1. Create the necessary directory structure with:-

```bash
mkdir -p dataset-name/cutouts
```
where `dataset-name` can be any name of your choosing.

2. GaMPEN expects input images to be in the `.fits` format, centered on the galaxy of interest. Make 
same-sized individual cutouts for all objects in your dataset; and place these files in `dataset-name/cutouts/`.

3. Place a file titled `info.csv` inside the `dataset-name` directory. This file should have (at least) a column titled `file_name` (corresponding to the names of the files in `dataset-name/cutouts`), a column titled `object_id` (with a unique ID for each file in `dataset-name/cutouts/`) and one column each for the parameters that you are trying to predict. For example, if you are trying to predict the radius and magnitude of a galaxy, you would have two columns titled `radius` and `magnitude`. 

:::{note}
Besides the `file_name` and `object_id` columns; all other columns in `info.csv` can be named according to your choosing. 
There are also no limitations on additional columns being present in `info.csv`.
:::

4. Next, separate your dataset into train, devel, and test splits with the following command:-

```bash
python ggt/data/make_splits.py --data_dir=/dataset-name/
```
:::{attention}
You should provide the full path of the `dataset-name` directory to the `data_dir` argument.
:::

The `make_splits.py` file splits the dataset according to a set of pre-determined fractions and you can choose to use any of these for your analysis. Details of the various splits are mentioned on the [Using GaMPEN](Using_GaMPEN.md#make-splits) page.

After generating the splits, the `dataset-name` directory should look like this:
```
- dataset_name
    - info.csv
    - cutouts/
    - splits/
```

:::{tip}
To change the fractions (of train/devel/test data) in the various splits; alter the `split_types` dictionary in `make_splits.py`
:::

5. Follow the instructions in [Running the trainer](#running-the-trainer).