# Training and evaluation a policy

Training a policy using multi-agent reinforcement learning is as simple as editing a config file.

### Edit the config file to be passed to SocialGym

To ease batch training jobs we created a wrapper class around the training and evaluation code that can be configured  via a configuration file (.yaml file). These configuration files are stored in `{PROJECT ROOT}/config_runner/configs`. ConfigRunner allows you train and evaluate a policy using one of these files via

```shell
python config_runner/run.py -c {path_to_config}
```

where the `{path_to_config}` is the relative path from `{PROJECT ROOT}/config_runner/configs` to a specific config file. You can run batch training jobs by either separating each unique configuration file with a white space and the `-c` flag, or you can use the `-f` flag (meaning `folder`) and point it at a folder in the `{PROJECT ROOT}/config_runner/configs` directory. 

An example of a config is shown below:

```yaml
{
  "num_agents": [[0, 3], [35, 4], [70, 5]],
  "eval_num_agents": [3, 4, 5, 7, 10],
  "train_length": 250000,
  "ending_eval_trials": 25,
  "eval_frequency": 0,
  "intermediate_eval_trials": 25,
  "policy_algo_sb3_contrib": false,
  "policy_algo_name": "PPO",
  "policy_name": "MlpPolicy",
  "policy_algo_kwargs": {"n_steps":  4096},
  "monitor": false,

  "experiment_names": ["envs_door"],

  "run_name": "door_ao",
  "run_type": "AO",
  "device": "cuda:0",

  "other_velocities_obs": true,
  "agent_velocity_obs": true,

  "agent_velocity_ignore_theta": false,
  "other_velocities_ignore_theta": false,
  "other_poses_ignore_theta": false,
  "agent_pose_ignore_theta": false,

  "entropy_constant_penalty": -100000,
  "entropy_constant_penalty_only_those_that_did_not_finish": true,

  "timelimit": true,
  "timelimit_threshold": 3000
}
```

### The main training loop using the config file you just edited

Each attribute in the yaml configuration matches an argument passed into the `run` function in `{PROJECT_ROOT}/src/config_run.py`. Social Gym 2.0 uses a familiar Gym-like training loop with some important deviations, an example of training a new policy using a config file is shown below.

```python
# 1.) Create the Scenario
scenario = GraphNavscenario('envs/scenario/hallway')

# 2.) Creating the Observer through modular Observations that are customizable
observations = [
  AgentsPose(ignore_theta=True), 
  OtherAgentObservables(ignore_theta=True),
  CollisionObservation(),
  SuccessObservation()
]
observer = Observer(observations)

# 3.) Creating the Rewarder with a sparse goal reward and a penalty term that scales over the course of training.
rewards = [
  Success(weight=100),
  LinearWeightScheduler(Collisions(), duration=10_000)
]
rewarder = Rewarder(rewards)

# 4.) Create the base class
env = RosSocialEnv(observer, rewarder, scenario, num_agents=7)

# 5.) Custom wrappers
env = EntropyEpisodeEnder(env)
env = NewScenarioWrapper(env, new_scenario_episode_frequency=1, plans=num_agents if isinstance(num_agents, list) else [0, num_agents])

# 6.) Wrappers that convert PettingZoo into a Stable Baselines v3 environment
env = ss.black_death_v3(env)
env = ss.pad_observations_v0(env)
env = ss.pad_action_space_v0(env)
env = ss.pettingzoo_env_to_vec_env_v1(env)
env.black_death = True

env = ss.concat_vec_envs_v1(env, 1, num_cpus=1, base_class='stable_baselines3')

# 7.) Stable Baselines v3 normalization and monitoring wrappers.
env = VecNormalize(env, norm_reward=True, norm_obs=True, clip_obs=10.)
env = VecMonitor(env)

  
# 8.) Standard Gym Interfacing for Training and Stepping 
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=10_000)

# 9.) Stepping through the environment with a trained policy.
obs = env.reset()
while env.agents:
    action, _states = model.predict(obs)
    obs, rewards, terminations, infos = env.step(actions)
```