OpenEnv · AgentX Phase 2

MiniGridEnv

An OpenEnv-native wrap of Farama MiniGrid/BabyAI for text-grounded navigation, extended with cross-episodic, LLM-rewritten markdown memory and branch-stable GRPO.

AgentX Phase 2 · OpenEnv Challenge Submission | Yashaswi Sharma (University of Southern California) | Dongze Ye (USC) | Defu Cao (USC) | Muyan Weng (USC)

Live Environment Space → MiniGridEnv (GitHub) MiniGridPT (GitHub)

Text-grounded navigation with a self-curated notebook

Most LLM benchmarks ask what a model can say. Few ask whether it can act in a grounded compositional world while curating its own persistent notebook. MiniGridEnv is an OpenEnv-native wrap of Farama's MiniGrid / BabyAI that gives an LLM a 7×7 egocentric world rendered as natural language, natural-language actions ("go forward", "pickup", "turn left"), and BabyAI's ten-stage compositional instruction curriculum from GoToRedBall to BossLevel.

This blog is about the extension. The base environment is a faithful OpenEnv wrap of MiniGrid/BabyAI (existing work, now interoperable). The novel contribution is cross-episodic memory: a line-limited markdown file the agent reads before each action and rewrites at the end of each episode, plus branch-stable GRPO file naming so each parallel rollout chain keeps one stable file to compact across optimizer steps.

Every reward signal is ground-truth arithmetic from the underlying BabyAI bot-verifiable success criterion. There is no LLM judge in the loop.

The falsifiable claims:

GRPO post-training on grounded navigation produces monotonically increasing completion rates across BabyAI's curriculum.
Cross-episodic memory measurably improves completion rate over stateless play, and the memory content evolves from random notes into structured strategies as training progresses.

Figure 0 (the banner above) encodes the contribution at a glance: the Observe panel matches the text-observation stack in Environment design; the Act panel matches NL actions, parsing, and OpenEnv stepping in the same section; the Remember panel matches cross-episodic memory $M$ in Cross-episodic memory and the training loop in Architecture & training pipeline.

Why this benchmark matters

Grounded navigation with compositional language is a load-bearing capability for embodied agents, web agents, and any LLM that must act under an observation budget. BabyAI has been the reference curriculum for this since 2019, but its native interface is a raw gym environment, not a WebSocket contract a GRPO trainer can consume across machines, Docker containers, and HF Spaces with a single code path.

The methodology is transferable. Any text-grounded sequential task with a sparse terminal reward and compositional instructions (web navigation, tool-use, interactive debugging, embodied robotics simulators) fits the same MDP template. Memory is also transferable: line-limited LLM-rewritten markdown is a general mechanism for self-directed state that is not specific to BabyAI.

The environment is engineering-cheap to scale. MiniGrid steps are microseconds; an instance is 1–5 MB; the OpenEnv wrapper sets max_concurrent_envs=256 out of the box. An LLM-backed environment cannot match that density.

Prior work & novelty

The space of "LLMs + text-grounded navigation + memory" sits across three prior buckets. None occupies the cell we target:

Prior work bucket	What it does	What it does not
BabyAI / MiniGrid (base) Chevalier-Boisvert et al., arXiv:1810.08272 (ICLR 2019); Farama-Foundation/Minigrid	Compositional language-conditioned navigation as a gym environment with a reference bot and a 10-stage difficulty curriculum	No OpenEnv/WebSocket contract; no text observation; no LLM post-training pipeline; no memory
Memory-augmented LLM agents Voyager (arXiv:2305.16291); Reflexion (arXiv:2303.11366); Generative Agents (arXiv:2304.03442)	Cross-episode skill libraries, verbal reflection, structured long-term memory, all prompt-engineered at inference time	No RL post-training; no branch-stable memory semantics under GRPO; not connected to OpenEnv
RLVR on language environments DeepSeekMath / GRPO (arXiv:2402.03300); TRL × OpenEnv (TRL docs)	Critic-free RL with verifiable rewards; standard WebSocket env contract and `rollout_func`	No persistent agent state across episodes; no first-class notion of branch-stable rollout chains
MiniGridEnv + MiniGridPT (ours)	OpenEnv wrap of MiniGrid/BabyAI + GRPO + cross-episodic LLM-rewritten markdown memory + branch-stable per-chain file naming	Not a human study; memory is text-only (no retrieval index)

To our knowledge, no prior work combines an OpenEnv-native BabyAI environment with GRPO post-training, line-limited LLM-rewritten cross-episodic memory, and branch-stable memory-file naming that keeps each parallel GRPO chain anchored to a stable file across optimizer steps. The env-contract, memory semantics, and training package are the contribution; MiniGrid/BabyAI are the shoulders we stand on.

What MiniGridEnv + MiniGridPT are

Two strictly separated packages. MiniGridEnv (the OpenEnv-compatible environment) and MiniGridPT (the GRPO training client) communicate exclusively over WebSocket. No shared Python imports. The training container is pure-GPU; the environment container is CPU-only.

Each episode:

The env samples a BabyAI level (GoToRedBall … BossLevel), seeds procedural generation, and emits a mission like "go to the red ball" or "open the door on your left, then put the green ball next to the yellow key".
On turn t, the agent sees a natural-language description of its 7×7 egocentric view plus the mission, and emits Thought: …\nAction: <one of 7 actions>.
A local parser normalizes the action into MiniGrid's Discrete(7) space; the gym env steps; the wrapper builds the next text observation.
Mid-episode reward is zero. On success the env emits +1.0 (binary reward, the GRPO-friendly default).
Memory mode only: at episode end the LLM reads a post-episode prompt and rewrites its persistent memory/*.md file for the next episode.

The agent's interface is deliberately minimal: plain Thought:/Action: text, no tool-call protocol, no JSON schema. The training client parses and steps the environment over WebSocket.

Environment design

The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (MiniGridEnv/env/models.py):

# Action (agent -> env)
class MiniGridAction(Action):
    command: str                 # "go forward", "turn left", "pickup", ...
    thought: Optional[str] = None # logged for analysis, not executed

# Observation (env -> agent)
class MiniGridObservation(Observation):
    text: str                    # NL description of the 7x7 egocentric view
    mission: str                 # "go to the red ball", ...
    step_idx, steps_remaining, max_steps: int
    history: list[dict]          # recent step summaries
    level_name: str
    last_action: Optional[str]
    action_success: Optional[bool]
    done: bool; reward: Optional[float]; metadata: dict

# State (hidden from agent; logging / eval only)
class MiniGridState(State):
    level_name, level_difficulty, completed, truncated,
    total_reward, steps_taken, optimal_steps, efficiency_ratio,
    valid_actions, invalid_actions, action_distribution

The text observation (quality lever #1)

MiniGrid's raw observation is a (7, 7, 3) numpy grid of (object type, color, door state) with the agent fixed at row=6 col=3 facing "up". env/grid_to_text.py turns that into a layered NL description:

Mission: …
You are facing {east,south,west,north}.
Immediate surroundings: ahead / left / right single-cell descriptions.
Path ahead: compresses runs of empty cells (e.g. "empty for 3 steps, then a closed red door, then a wall").
Notable objects: interactive items (key, ball, box, goal, door, lava) with relative phrases ("2 steps ahead and 1 to your right"), sorted by Manhattan distance.
Carrying state.

The internal design note is blunt: "the quality of the text observation is the single biggest lever on training success." Everything else in the environment is a thin layer over the gym loop.

Actions (quality lever #2): NL → Discrete(7)

env/action_parser.py maps natural-language strings to MiniGrid's discrete action index. The same logic is duplicated (intentionally) in MiniGridPT/training/openenv_runtime.py so the PT package remains standalone; a parity test guards the two copies.

Canonical	Index	Accepted aliases
`turn left`	0	`left`
`turn right`	1	`right`
`go forward`	2	`move forward`, `forward`, `ahead`, `step`, `walk`
`pickup`	3	`pick up`, `grab`, `take`, `get`
`drop`	4	`release`, `put down`
`toggle`	5	`open`, `close`, `unlock`, `switch`
`done`	6	`wait`, `noop`, `stop`

An unparseable string falls back to go forward, not to done. Rationale: early in training, exploration beats noop; every invalid parse increments a counter so we can watch parse-rate climb with training.

BabyAI curriculum (10 levels)

env/levels.py registers the full BabyAI ladder with candidate gym IDs (so minigrid version drift between BabyAI-GoToRedBallGrey-v0 and BabyAI-GoToRedBall-v0 doesn't brick a run):

Stage	Level	Gym ID	Max steps	Optimal
0	GoToRedBall	`BabyAI-GoToRedBallGrey-v0`	64	~10
1	GoToObj	`BabyAI-GoToObj-v0`	64	~12
1	GoToLocal	`BabyAI-GoToLocal-v0`	64	~15
2	PickupLoc	`BabyAI-PickupLoc-v0`	64	~14
2	OpenDoor	`BabyAI-OpenDoor-v0`	64	~12
2	UnlockLocal	`BabyAI-UnlockLocal-v0`	128	~25
3	GoTo	`BabyAI-GoTo-v0`	128	~30
3	PutNextLocal	`BabyAI-PutNextLocal-v0`	128	~20
4	Synth	`BabyAI-Synth-v0`	128	~40
4	BossLevel	`BabyAI-BossLevel-v0`	128	~80

A single Docker container serves every stage. env.reset(level="BossLevel") switches the underlying gym env per-reset. A fix replaced the original del kwargs in reset() with a kwargs.pop("level", None), which is what unlocked single-server curriculum training. Per-level max_steps are defined in our LevelConfig registry (env/levels.py); Synth and BossLevel are capped at 128 steps in this repo so episode length (and vLLM server-mode padding budgets) stay bounded for training.

Reward

Default: binary. +1.0 on completion, 0.0 otherwise. GRPO works best with clean sparse signals. RewardConfig also supports shaped (step penalty + invalid-action penalty) and efficiency (bonus scaled to optimal_steps/steps_taken) modes if a stage stalls.

Let $r_t$ denote the per-step environment reward (binary default). With horizon $T$ (our capped max_steps), mission success at termination gives a single $+1$ spike:

$$r_t = \begin{cases} +1 & \text{if the BabyAI mission is satisfied when the episode ends at step } t \\ 0 & \text{otherwise} \end{cases}$$

In the default mode, $r_t = 0$ for all $t < T$ unless the mission completes early; shaping modes spread signal across steps via RewardConfig in env/reward.py.

Why OpenEnv

OpenEnv gives us three things that matter for this submission:

A standard WebSocket environment contract consumable by TRL's rollout_func with typed Pydantic payloads and Gym-style reset/step semantics.
Per-session state with SUPPORTS_CONCURRENT_SESSIONS=True and max_concurrent_envs=256. DDP ranks can hammer the same Space without cross-talk because each WebSocket session gets a fresh gym.Env instance (MiniGrid is not thread-safe; factory mode is mandatory).
Uniform deployment. Identical env code runs in-process for tests, as a Docker container for development (server/Dockerfile, openenv-base, port 8000), and as a Hugging Face Space during training and evaluation.

No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation / State. Curriculum level, history, and per-episode metrics ride on metadata and state. The environment ships with openenv.yaml, a Dockerfile, and an HF Space.

Critically, MiniGridPT does not import MiniGridEnv. Everything crosses the wire. A MiniGridClient(EnvClient) in MiniGridPT/training/openenv_runtime.py sends plain dicts. This is the architectural lynchpin that lets the training node be pure-GPU and the environment node be CPU-only.

Cross-episodic memory

This is the research contribution. The base MiniGrid/BabyAI world is stateless between episodes: each reset gives the agent a fresh procedurally generated room with no persistent side-channel. We add one:

@dataclass
class MemoryConfig:
    enabled: bool = False
    max_lines: int = 100            # line-limit, not token-limit
    memory_dir: str = "./memory"
    agent_id: str = "default"
    branch_stable_memory: bool = False  # see below

    @property
    def memory_path(self) -> Path:
        return Path(self.memory_dir) / f"{self.agent_id}.md"

Four deliberate design choices, each rejecting a plausible alternative:

Line limit, not token limit. Lines are visible and countable by the model in the prompt ((42/100 lines)). The model gets a concrete budget it can reason about.
Full replacement, not append. At each episode end the LLM rewrites the file from scratch. This forces the agent to decide what to keep vs. evict (the interesting half of curation).
Unstructured markdown, not schema. No bullets required, no JSON. The research question is whether the model will self-organize useful knowledge, not whether it can fill in a template.
Truncation from the top. Safety net only; if the model overshoots max_lines, keep the most-recently-written lines.

Post-episode rewrite via `_temporary_vllm_max_tokens`

Action turns need ~128 tokens (Thought: …\nAction: go forward). The memory rewrite needs ~512 (100 lines at ~5 tokens/line worst case). One global max_completion_length cannot satisfy both. The fix is a context manager:

@contextmanager
def _temporary_vllm_max_tokens(trainer, max_tokens: int):
    vg = trainer.vllm_generation
    prev = vg.max_completion_length
    vg.max_completion_length = max_tokens
    try:
        yield
    finally:
        vg.max_completion_length = prev

# Used both for the 512-token memory rewrite and for the 1-token
# NCCL-padding dummy generates described in the Engineering section.

Branch-stable file naming (per-chain compaction)

GRPO runs G parallel completions per prompt, each with its own advantage and gradient contribution. If every slot writes to a uniquely-named file, there's no continuity across optimizer steps, so each memory chain is one episode long. If every slot writes to one shared file, writes race and the signal is mush.

The solution: branch-stable naming rank{R}_br{k}_{base}.md with k = slot_idx % num_generations. The k-th parallel generation maps to a stable file across optimizer steps, so branch k after prompt group P1 is the same file used by branch k after prompt group P2. Each of the G GRPO branches builds its own evolving notebook, which is what gives the model a training signal to compact and summarize episode-to-episode.

Requires per_device_train_batch_size == num_generations (otherwise multiple groups in one step hit the same k and a one-time UserWarning fires). A third scheme (a single shared file across all slots and ranks) is sketched but not landed; it needs a decision about concurrent-writer races.

Let $M_e \in \mathcal{M}$ denote the memory file (markdown string) at the start of episode $e$, let $\tau_e$ be the trajectory (observations, parsed actions, outcomes), and let $\pi_\theta^{\mathrm{mem}}$ be the same LLM invoked on the post-episode memory-update prompt. The write is a full rewrite followed by a line-budget projection $\Pi_L(\cdot)$ that keeps the last $L$ lines (here $L = 100$):

$$M_{e+1} = \Pi_L\!\left( \pi_\theta^{\mathrm{mem}}(M_e,\, \tau_e,\, \mathrm{outcome}_e) \right).$$

Branch-stable filenames tie each GRPO branch index $k = s \bmod G$ to a stable path across optimizer steps, for DDP rank $R$, slot index $s$, group size $G = \texttt{num\_generations}$, and basename base (e.g. default):

$$\mathrm{path}(R,s,\mathrm{base}) \;=\; \texttt{memory/rank}R\texttt{\_br}_{\,k}\texttt{\_}\mathrm{base}\texttt{.md}\,,\quad k = s \bmod G.$$

This is exactly the Remember panel in Figure 0: the file card is $M_e$ at read time; the post-episode LLM box is $\pi_\theta^{\mathrm{mem}}$; the curved arrow is the next-episode read of $M_{e+1}$.

Can an LLM learn to curate its own persistent, line-budgeted notebook such that cross-episodic memory measurably improves completion rate, and the memory content evolves from random notes into structured strategies as training progresses?

Scoring & reward shaping

The environment reward is terminal and sparse. Everything else is a small shaping bonus designed to rule out pathological regimes without dominating the signal.

Component	Range	Source	What it rewards
Env reward (binary)	0 or +1	`env/reward.py`	Mission completed (BabyAI ground-truth success)
Format reward	[−0.1, +0.1]	`reward_funcs.reward_format`	Both `Thought:` and `Action:` present (1.0), one (0.5), neither (0.0), rescaled
Memory: in-budget	+0.05	`compute_memory_quality_flags`	Memory rewrite stayed within `max_lines` (no truncation)
Memory: non-empty	+0.05	`compute_memory_quality_flags`	Agent is actually writing something
Memory: not-a-dump	−0.05	`memory_looks_like_observation_dump`	Penalty if memory is just a copy of the last observation

Design principle: env reward dominates. Format and memory-quality bonuses are at ±0.1–0.15 scale, intended as training wheels, removable once the model reliably emits structured output (>90% validity) and writes substantive memory.

Let $\tau$ denote an episode trajectory and $M_e, M_{e+1}$ memory before/after the episode. Write $R_{\mathrm{env}} = \sum_t r_t \in \{0,1\}$ for the binary BabyAI success signal, $R_{\mathrm{fmt}}(\tau)$ for the rescaled format score in $[-1,1]$ (mapped to $[-0.1,0.1]$ via $\alpha_{\mathrm{fmt}} = 0.1$ in code), and $R_{\mathrm{mem}}(M_{e+1})$ for the memory-quality shaping used by the trainer. The scalar logged to TRL as env_reward is:

$$R(\tau, M_e, M_{e+1}) \;=\; \underbrace{R_{\mathrm{env}}}_{{\in \{0,1\}}} \;+\; \underbrace{\alpha_{\mathrm{fmt}}\, R_{\mathrm{fmt}}(\tau)}_{{\in [-0.1,\,0.1]}} \;+\; \underbrace{R_{\mathrm{mem}}(M_{e+1})}_{{\in [-0.05,\,0.10]}}.$$

With indicator $\mathbf{1}[\cdot]$, line budget $L$, and dump detector $\mathrm{dump}(M)$ (true when memory is effectively a copy of the last observation):

$$R_{\mathrm{mem}}(M) = \beta_{\mathrm{budget}}\,\mathbf{1}\big[\mathrm{lines}(M) \le L\big] + \beta_{\mathrm{ne}}\,\mathbf{1}\big[M \neq \varnothing\big] - \beta_{\mathrm{dump}}\,\mathbf{1}\big[\mathrm{dump}(M)\big],$$

with $\beta_{\mathrm{budget}} = \beta_{\mathrm{ne}} = \beta_{\mathrm{dump}} = 0.05$ as implemented in MiniGridPT (names may differ slightly in code; the ranges in the table above match the shipped constants).

Why memory quality has a negative flag. Without the memory_looks_like_observation_dump penalty, the shortest-path way to collect the non-empty bonus is to paste the last observation into memory. That gives zero cross-episodic signal. The penalty forces the memory to be compressed / abstracted, which is the interesting behavior.

Architecture & training pipeline

Two strictly separated packages. MiniGridEnv (OpenEnv environment) and MiniGridPT (GRPO training client) communicate exclusively over WebSocket (no in-process imports).

flowchart LR
    subgraph PT ["MiniGridPT (Training)"]
        GRPO["GRPOTrainer
TRL 1.0.0"]
        RF["rollout_func
(per-episode loop)"]
        VLLM["vLLM
colocate/server"]
        PARSE["parse_action
(NL -> Discrete(7))"]
        MEM["memory/rank{R}_br{k}_default.md
(branch-stable)"]
    end
    subgraph ENV ["MiniGridEnv (OpenEnv)"]
        WS["FastAPI
WebSocket"]
        GYM["MiniGrid gym env
(BabyAI level)"]
        TEXT["grid_to_text
(7x7 -> NL)"]
        REW["Reward
binary +1.0"]
    end

    GRPO --> RF
    RF --> VLLM
    VLLM -->|"generate Thought/Action"| PARSE
    PARSE -->|"{command, thought}"| WS
    WS --> GYM
    GYM --> TEXT
    TEXT -->|"observation.text"| WS
    WS -->|"obs + reward + done"| RF
    RF -->|"post-episode rewrite"| MEM
    MEM -->|"read at t=0 next episode"| RF
    REW --> WS

Figure 1. System architecture. PT never imports env-side types. Memory is a per-branch markdown file owned by the training client, rewritten by the LLM at each episode end.

Training uses GRPO (Group Relative Policy Optimization), a critic-free RL algorithm ideal for terminal-only rewards. We use TRL 1.0.0's rollout_func contract for explicit control over the generate → parse → env.step loop.

The rollout function

Per slot, _rollout_one_episode (MiniGridPT/training/rollout_func.py) runs a complete episode inside one training step:

Build initial chat messages (system + first user observation block, with the current memory folded in if enabled).
Open a WebSocket session via MiniGridClient(base_url=ENV_BASE_URL).sync() and call env.reset(level=LEVEL, seed=…).
Episode loop until done or the turn cap: generate with vLLM (max_completion_length=128), append tokens to completion_ids with env_mask=1, parse Thought/Action, call env.step({"command": canonical, "thought": thought}), append the rendered next-observation tokens with env_mask=0.
Post-episode memory rewrite (memory mode only): build MEMORY_UPDATE_PROMPT with outcome / steps / current memory / line count / budget, call generate() wrapped in _temporary_vllm_max_tokens(trainer, 512), write to the branch-stable file, append tokens with env_mask=0.

The return dict is the shape TRL's GRPOTrainer consumes:

{
    "prompt_ids":     list[list[int]],   # one per slot (fixed initial prompt)
    "completion_ids": list[list[int]],   # full episode (LLM + env user turns)
    "logprobs":       list[list[float]], # from vLLM; zero-filled for env_mask=0
    "env_mask":       list[list[int]],   # 1 = LLM token, 0 = env/context token
    "env_reward":     list[float],       # binary env reward + memory-quality bonus
}

The env_mask is what lets us mix LLM-authored tokens (eligible for the env-reward term in the GRPO objective) with env-rendered context tokens (visible for KL-to-reference but excluded from the advantage weighting). Without it, the model would be "rewarded" for tokens it didn't generate.

At episode boundaries, the rollout reads memory $M_e$ into the prompt, rolls out $\tau$ with environment observations rendered as tokens with mask $0$, then applies $\pi_\theta^{\mathrm{mem}}$ to obtain $M_{e+1}$ as in Section 7, the same loop sketched in Figure 0 (Remember).

Let $y_{i,1:T_i}$ be the token sequence for completion $i$ (including env turns), $m_{i,t} \in \{0,1\}$ the env mask, and $\rho_{i,t}(\theta) = \pi_\theta(y_{i,t}\mid y_{i,1:t-1}) / \pi_{\theta_{\mathrm{old}}}(y_{i,t}\mid y_{i,1:t-1})$ the importance ratio on LLM-authored tokens. With clipping threshold $\epsilon$, KL coefficient $\beta_{\mathrm{KL}}$, and group-relative advantage $A_i = (R_i - \mu_R)/\sigma_R$ over $G$ parallel completions sharing a prompt, the masked GRPO-style surrogate we target is:

$$\mathcal{L}_{\mathrm{GRPO}}(\theta) \;=\; -\,\mathbb{E}\left[ \sum_{i=1}^{G} \sum_{t : m_{i,t}=1} \min\!\Big( \rho_{i,t}(\theta)\, A_i,\; \mathrm{clip}\big(\rho_{i,t}(\theta), 1-\epsilon, 1+\epsilon\big)\, A_i \Big) \right] + \beta_{\mathrm{KL}}\, D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\mathrm{ref}}).$$

Here $R_i \equiv R(\tau_i, M_e, M_{e+1})$ is the scalar from Section 8 (environment return plus shaping). Tokens with $m_{i,t}=0$ contribute to the KL / context loss path in TRL but not to the clipped policy-gradient term above.

Memory-evolution gallery (illustrative)

We do not include step-by-step episode traces here: the training logs for the current run are not immediately accessible, and the priority for this submission is the mechanism (Figure 0, Sections 7–9) rather than cherry-picked rollouts. Compute for this project is exhausted before we could finish a converged memory ablation; the author is also concurrently submitting LotteryElicitationEnv and ReasoningEconomicsEnv to the same OpenEnv track, so GPU budget is shared across multiple codebases.

The cards below are not verbatim snapshots from a finished training run. They are category placeholders for what we expect to extract once additional compute is available to run memory-structure experiments and to save real memory/rank{R}_br{k}_*.md files at checkpoints. Future work will test alternative memory organizations (Section 11) under that budget.

Qualitative memory-file evolution

When a long run exists, we will snapshot memory/rank0_br0_default.md (or branch-stable peers) and categorize content. For now, each panel illustrates a type of content we expect to see at different training phases:

memory/rank0_br0_default.mdstep ~500 (noise)

ball is red
i saw a door
step 3 turn left
step 4 go forward

memory/rank0_br0_default.mdstep ~5000 (action patterns)

- if the Mission says "go to X", first face X
- turn left/right before go forward if object is
  to the side
- on GoToRedBall the ball is usually 1-3 steps away

memory/rank0_br0_default.mdstep ~15000 (level-specific notes)

- UnlockLocal: keys are the same color as doors
- OpenDoor: "toggle" opens closed and locked doors
  (if carrying correct key)
- Synth: mission has multiple clauses -> do them
  left-to-right as written

memory/rank0_br0_default.mdstep ~25000 (failure notes)

- if action_success=False on go forward, there is a
  wall/door -> rotate before next step
- pickup with no adjacent object always fails; read
  "carrying: nothing" before attempting

Verbatim memory snapshots from a converged or long partial run will replace these placeholders when further compute is available. Until then, the gallery documents the hypothesis space for how $M$ should evolve, not empirical outcomes from the current submission.

Results: what we found

Status (honest scope)

The MiniGridPT training package was exercised for correctness (short runs, parser parity, WebSocket stepping, memory file I/O, vLLM colocate and multi-GPU server mode with NCCL-safe padding). We do not report converged learning curves or final completion rates: the policy did not converge on the full curriculum under the available budget, and structured experiments on alternative memory formats are deferred to a follow-up compute cycle. Compute for this line of work is exhausted for the current submission window; the author is concurrently shipping LotteryElicitationEnv and ReasoningEconomicsEnv to the same OpenEnv track, so GPU time is shared across multiple submissions.

What we validated (engineering, not leaderboard numbers)

Stable multi-turn rollouts against the live OpenEnv WebSocket from TRL's rollout_func, with env_mask partitioning LLM-authored vs. env-rendered tokens and per-episode logs persisting to the --output_dir.
Single-A100 colocate smoke runs on Qwen3-8B and Qwen2.5-1.5B-Instruct (hundreds of optimizer steps, not a converged curriculum); MGPT_VLLM_GPU_UTIL tuned to ~0.45–0.65 on 40 GB (see Engineering lessons).
Cross-episodic memory I/O through _temporary_vllm_max_tokens(trainer, 512); branch-stable filenames rank{R}_br{k}_default.md persisting across optimizer steps.
Multi-GPU server-mode training with fixed-count generate padding (DIST_SERVER_GENERATES_PER_EPISODE) eliminating NCCL desync under variable-length episodes (now bounded by our capped max_steps ≤ 128 per level in env/levels.py).
Lambda runbook: bootstrap_lambda.sh → preflight_lambda.sh → run_grpo_lambda.sh, with MGPT_* env vars and cadence / metrics callbacks writing metrics_scalars.csv, metrics_events.jsonl, cadence.log, diagnostics_cadence.jsonl.
36 env-side tests and a PT action-parser parity test enforcing the NL → Discrete(7) contract across both packages.

Baseline harness (wired, not benchmarked)

The environment bundles three baselines (Random, BabyAI BotAgent, and a caller-provided zero-shot completion_fn), all runnable in-process without a GPU. They are not executed at scale in this submission; longer baseline sweeps and GRPO comparisons are explicitly scoped for the next compute allocation.

Memory design space: planned experiments

Because the current run did not converge and memory-structure ablations are outstanding, the table below is the forward-looking experiment matrix for how $M$ might be organized once additional GPU budget is available. Each row states a hypothesis; all rows require additional compute.

Variant	Hypothesis tested	Notes
Structured schema (JSON / YAML / fixed markdown sections)	Schema > free-form markdown for stable curation	Requires additional compute
Append + periodic compaction	Full-episode rewrite cost limits the learning signal	Requires additional compute
Hierarchical (in-episode scratchpad + cross-episode long-term)	Conflating short- and long-term in one file hurts	Requires additional compute
Retrieval-indexed (embed notes, top-k by observation)	Linear-file recall fails at scale	Requires additional compute
Shared single-file across branches / ranks	Collective memory beats per-branch curation	Shared-memory design TBD; requires additional compute + concurrency design
Success-gated writes	Failure episodes poison $M$	Requires additional compute
Variable line budget by level difficulty	Uniform $L$ is too tight for hardest stages	Requires additional compute
Dual-memory (policy vs. world knowledge)	Unified $M$ conflates two knowledge types	Requires additional compute
Token budget instead of line budget	Line-count is the wrong self-budgeting unit for LLMs	Requires additional compute

The research question in Section 7 remains the scientific target; this submission's empirical contribution is the validated pipeline and semantics for $M$, not yet a table of win-rates.

Engineering lessons

Running GRPO + OpenEnv + vLLM on a multi-turn, memory-augmented environment surfaced three categories of structural issues. We document the ones that are general; the next OpenEnv submission is likely to hit each.

NCCL desync under variable-length episodes

In vllm_mode=server, each trainer.vllm_generation.generate() call performs gather_object → all_gather_object → broadcast_object_list. Our rollout runs while not session.done, so different DDP ranks make different numbers of generate() calls per episode: a short run (few turns) vs. the per-level max_steps cap (64 on early BabyAI stages, up to 128 after our registry cap for Synth and BossLevel). NCCL collectives are sequence-numbered: different call counts per rank = permanent desync.

Symptoms: training tqdm stuck, GPU 0–N pinned, vLLM GPU idle, NCCL watchdog firing after its timeout, UnpicklingError as ranks deserialize off-by-one collective buffers.

Fix: fixed-count padding: every rank performs exactly DIST_SERVER_GENERATES_PER_EPISODE generates per episode, where the count is max_episode_turns + (1 if memory_enabled else 0). After the real loop terminates, _pad_vllm_server_generates_to_target issues dummy 1-token generates under _temporary_vllm_max_tokens(trainer, 1), outputs discarded, guarded with try/finally. Active only when vllm_mode == "server" and world_size > 1; reward, logprobs, and credit assignment are byte-identical to the unpadded case.

This pattern is general. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent. LotteryElicitationEnv/PT (sibling project) hit it first; the same fix ported cleanly here.

vLLM colocate GPU-memory utilization is a total-VRAM fraction

MGPT_VLLM_GPU_UTIL is passed to TRL → vLLM as --vllm_gpu_memory_utilization. vLLM interprets it as the fraction of total device VRAM the engine may reserve (weights + KV budget). Not "fraction of what PyTorch left free."

In colocate mode, the policy model loads first, then vLLM tries to grab its share on the same GPU. Too high → vLLM startup ValueError or later torch.OutOfMemoryError on logprob / lm_head. The shipped TRL default of 0.9 is too aggressive on 40 GB A100 colocate. Safe range: 0.45–0.65.

Server mode needs ≥2 GPUs (splits vLLM vs. training devices); MGPT_VLLM_MODE=auto picks server on ≥2 GPUs else colocate.

Hygiene table (four smaller issues, their fixes)

Issue	Root cause	Fix
Single-server curriculum blocked	Scaffold `reset()` did `del kwargs` before forwarding to the gym env, dropping the `level` kwarg	`level = kwargs.pop("level", None)` (and `level_name`) before clearing; now one Docker container serves every stage
Branch-stable memory races	When `per_device_train_batch_size > num_generations`, multiple prompt groups in one step map to the same branch index k and race on the file	Asserted at startup; a one-time `UserWarning` if the invariant is ever broken; recommended configuration `batch == num_generations`
Action-parser drift	PT package ships its own `parse_action` (so it doesn't import the env); env-side changes can silently diverge	Parity test `tests/test_action_parser_parity.py` in MiniGridPT cross-compares canonical actions + aliases against the env's parser
"go forward" fallback on unparseable text	Early-training LLMs emit malformed text; mapping to `done` kills episodes instantly (zero signal)	Fallback = `go forward`, not `done`; every invalid parse increments `invalid_actions` so the parse-rate climb is a visible training-progress curve

Where this submission sits

quadrantChart
    title Grounded navigation: memory x OpenEnv/RL
    x-axis "Stateless" --> "Memory-augmented"
    y-axis "Gym only" --> "OpenEnv + RL stack"
    quadrant-1 "Our target"
    quadrant-2 "Untouched"
    quadrant-3 "Classical RL"
    quadrant-4 "Prompt-only"
    "BabyAI / MiniGrid": [0.10, 0.14]
    "Lottery (sibling env)": [0.22, 0.90]
    "GRPO, no memory": [0.42, 0.62]
    "Voyager": [0.90, 0.34]
    "Reflexion": [0.74, 0.24]
    "GenAgents": [0.88, 0.14]
    "MiniGridEnv + MiniGridPT": [0.90, 0.88]

Figure 2. MiniGridEnv + MiniGridPT occupy the memory-augmented + OpenEnv + post-training quadrant that prior work leaves untouched. Voyager / Reflexion / Generative Agents are memory-rich but prompt-only; BabyAI itself is a gym env without an OpenEnv or RL-post-training story; sibling LotteryElicitationEnv is OpenEnv + RL but stateless.

Next step: GiGPO

GRPO is good enough to ship this submission: critic-free, works out of the box in TRL, scalar per-episode advantage is fine for short-horizon BabyAI stages. But it underweights step-level credit assignment, which is exactly what hurts on 30+ turn episodes and what memory mode needs (memory episodes are ~2× longer).

GiGPO = GRPO + anchor-state step-level advantages. Episode-level macro advantage (same group-relative signal as GRPO over $G$ completions):

$$A^{E}_i \;=\; \frac{R_i - \mu_R}{\sigma_R}.$$

Step-level micro advantage within anchor-state group $S_k$ (all $(\tau,t')$ pairs whose observation text hashes match step $t$):

$$A^{S}(a_t) \;=\; \frac{Q(a_t) - \mu_{Q(S_k)}}{\sigma_{Q(S_k)}}\,,\quad S_k = \big\{ (\tau, t') : \mathrm{hash}(o_{t'}) = \mathrm{hash}(o_t) \big\}.$$

Combined per-token advantage with mixing weight $\omega \ge 0$:

$$A_t \;=\; A^{E}_i + \omega\, A^{S}(a_t).$$

When no anchors are found, $A^{S} = 0$ and GiGPO reduces to GRPO (equivalently $\omega = 0$).

Why this fits MiniGrid: all G rollouts share the same initial observation for a given prompt/seed (guaranteed anchor); corridor navigation revisits the same 7×7 egocentric view; BabyAI per-seed determinism creates exact hash matches. The full step-level design is deferred to the GiGPO follow-up (trainer subclass + rollout fields).

Experimental matrix for the follow-up

Config	Algorithm	Memory	Flags
A	GRPO	Off	`--loss_type dapo`
B	GRPO	On (branch-stable)	`--loss_type dapo --memory --memory-branch-stable`
C	GiGPO	Off	`--use_gigpo`
D	GiGPO	On (branch-stable)	`--use_gigpo --memory --memory-branch-stable`

Hypothesis: D dominates. Step-level anchor-state credit and cross-episodic strategy accumulation are complementary: GiGPO assigns credit within an episode; memory propagates credit across episodes.

Foundations & citations

Foundation	Role in this project	Citation
MiniGrid & BabyAI	Base gym environment, 10-stage curriculum, reference `BotAgent` upper bound, procedural level generation	Chevalier-Boisvert et al., arXiv:1810.08272 (ICLR 2019); Farama-Foundation/Minigrid
GRPO / DeepSeekMath	Critic-free group-relative policy optimization; our default trainer via TRL's `GRPOTrainer`	Shao et al., arXiv:2402.03300
TRL × OpenEnv	`rollout_func` contract, vLLM colocate/server, `loss_type=dapo` length-bias handling	TRL OpenEnv docs
OpenEnv	Standard WebSocket env contract, per-session state, `create_app`, HF Space deploy	HF Blog: Introducing OpenEnv
Voyager	Skill-library / cross-episode knowledge accumulation (closest memory-system analog; ours is RL-trained where Voyager is prompt-engineered)	Wang et al., arXiv:2305.16291
Reflexion	Verbal reflection after episodes; motivates a post-episode LLM rewrite pass over a persistent buffer	Shinn et al., arXiv:2303.11366
Generative Agents	Long-term memory stream with relevance / recency weighting; our line-budgeted rewrite is a deliberately simpler alternative	Park et al., arXiv:2304.03442
LotteryElicitationEnv / PT	Sibling OpenEnv submission; shared structural template for two-repo split, `rollout_func`, NCCL generate-count padding	Same monorepo · LotteryElicitationEnv HF Space
ReasoningEconomicsEnv / PT	Structural template for `_temporary_vllm_max_tokens` pattern	Same monorepo

Quick start

Single-A100 Lambda recipe (use MiniGridEnv Docker + MiniGridPT scripts/ as the source of truth for env vars and launch order):

# 0. Clone both packages (sibling directories)
git clone https://github.com/sharma-yash01/MiniGridEnv.git
git clone https://github.com/sharma-yash01/MiniGridPT.git

# 1. Build + start MiniGridEnv (Docker on port 8000)
cd MiniGridEnv
sudo docker build -t minigrid-env:latest -f server/Dockerfile .
sudo docker run -d --name minigrid-env -p 8000:8000 \
    -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
    minigrid-env:latest
curl -sS "http://127.0.0.1:8000/health"

# 2. Configure MGPT_* (single A100, colocate vLLM)
export ENV_BASE_URL="http://127.0.0.1:8000"
export MGPT_ROOT=$(pwd)/../MiniGridPT
export MGPT_VENV=$HOME/.venvs/minigridpt-lambda
export PYTORCH_WHEEL_INDEX=https://download.pytorch.org/whl/cu121
export MGPT_MODEL=Qwen/Qwen3-8B
export MGPT_LEVEL=GoToRedBall
export MGPT_VLLM_MODE=colocate
export MGPT_VLLM_GPU_UTIL=0.45           # colocate-safe on A100 40GB

# 3. Bootstrap + preflight + train
bash "$MGPT_ROOT/scripts/bootstrap_lambda.sh"
source "$MGPT_VENV/bin/activate"
bash "$MGPT_ROOT/scripts/preflight_lambda.sh"
cd "$MGPT_ROOT" && nohup bash scripts/run_grpo_lambda.sh > train.log 2>&1 &
tail -f train.log

# 4. Memory-mode variant (branch-stable, batch == num_generations)
export MGPT_MEMORY=1
export MGPT_MEMORY_MAX_LINES=100
export MGPT_MEMORY_BRANCH_STABLE=1
export MGPT_NUM_GENERATIONS=8
export MGPT_BATCH_SIZE=8
bash "$MGPT_ROOT/scripts/run_grpo_lambda.sh"

# 5. Full curriculum (GoToRedBall -> BossLevel)
export ENV_URL="${ENV_BASE_URL}"
export MODEL="${MGPT_MODEL}"
export BASE_OUT="${MGPT_OUTPUT_DIR}/curriculum"
export USE_MEMORY=1
bash "$MGPT_ROOT/scripts/launch_curriculum.sh"

All 36 env-side tests pass with cd MiniGridEnv && uv run --with pytest pytest tests. The OpenEnv contract is validated with openenv validate.

Compute budget exhausted for this submission

The training package is validated for correctness; converged runs, baseline tables, and memory-structure ablations require more GPU time. The author is concurrently submitting LotteryElicitationEnv and ReasoningEconomicsEnv to the same OpenEnv track, so resources are shared across all three. The open scientific question remains in Section 7; what ships now is the pipeline and the formal semantics for $M$.

Future work

Run the full A/B/C/D experimental matrix to publish the memory-vs-stateless and GRPO-vs-GiGPO comparison across the BabyAI curriculum once additional compute is available (measured numbers to be filled in after those runs).
Land GiGPO as a GiGPOTrainer(GRPOTrainer) subclass. Minimum diff: add obs_texts / step_boundaries to the rollout return, compute anchor-state groups, expand step advantages to tokens.
Close the inference-time gap: inference/run_episode.py reads memory during play but does not yet mirror training's post-episode LLM memory rewrite. Evaluation should match training end-to-end; add a post-episode-memory-rewrite eval variant when more compute is available.
Baseline harness at scale: run Random, BabyAI BotAgent, and zero-shot LLM baselines with enough seeds to report completion rates and calibration vs. GRPO / GRPO+memory (deferred for lack of compute).
Port the NCCL generate-count padding upstream into TRL: the bug is general, the fix is simple.
Harder curricula: extend beyond BabyAI (MiniHack, TextWorld) with the same OpenEnv wrap + memory template.
Human transfer pilot: does a memory-trained agent generalize to unseen BabyAI seeds better than stateless, and how much of the memory is environment-specific versus transferable strategy?

Conclusion

MiniGridEnv + MiniGridPT takes the gym-native MiniGrid/BabyAI curriculum and turns it into a complete OpenEnv + GRPO + memory pipeline. The environment is a faithful wrap: text observation, NL action, BabyAI's ten stages. The training package is the extension: branch-stable markdown memory, a post-episode LLM rewrite shaped by _temporary_vllm_max_tokens, and an env-mask-aware rollout loop that makes variable-length multi-turn episodes play nicely with vLLM server mode.

The infrastructure contributions (NCCL generate-count padding for variable-length rollouts, branch-stable per-chain memory files, the max_completion_length context manager for mixed action/memory generation budgets, per-reset curriculum via reset() kwargs) are lessons the next OpenEnv + TRL 1.0 + multi-turn + memory submission will need.

Empirical completion tables and memory ablations await the next compute cycle (Section 7 for the open question; Section 11 for the planned experiment matrix). What ships with this post is the validated pipeline and the formal semantics for $M$.