An OpenEnv-native wrap of Farama MiniGrid/BabyAI for text-grounded navigation, extended with cross-episodic, LLM-rewritten markdown memory and branch-stable GRPO.
Thought: / Action: parsed to Discrete(7), stepped over OpenEnv WebSocket), Remember (line-limited markdown $M$, rewritten by the same LLM after each episode; Section 7).Most LLM benchmarks ask what a model can say. Few ask whether it can act in a grounded compositional world while curating its own persistent notebook. MiniGridEnv is an OpenEnv-native wrap of Farama's MiniGrid / BabyAI that gives an LLM a 7×7 egocentric world rendered as natural language, natural-language actions ("go forward", "pickup", "turn left"), and BabyAI's ten-stage compositional instruction curriculum from GoToRedBall to BossLevel.
This blog is about the extension. The base environment is a faithful OpenEnv wrap of MiniGrid/BabyAI (existing work, now interoperable). The novel contribution is cross-episodic memory: a line-limited markdown file the agent reads before each action and rewrites at the end of each episode, plus branch-stable GRPO file naming so each parallel rollout chain keeps one stable file to compact across optimizer steps.
Every reward signal is ground-truth arithmetic from the underlying BabyAI bot-verifiable success criterion. There is no LLM judge in the loop.
The falsifiable claims:
Figure 0 (the banner above) encodes the contribution at a glance: the Observe panel matches the text-observation stack in Environment design; the Act panel matches NL actions, parsing, and OpenEnv stepping in the same section; the Remember panel matches cross-episodic memory $M$ in Cross-episodic memory and the training loop in Architecture & training pipeline.
Grounded navigation with compositional language is a load-bearing capability for embodied agents, web agents, and any LLM that must act under an observation budget. BabyAI has been the reference curriculum for this since 2019, but its native interface is a raw gym environment, not a WebSocket contract a GRPO trainer can consume across machines, Docker containers, and HF Spaces with a single code path.
The methodology is transferable. Any text-grounded sequential task with a sparse terminal reward and compositional instructions (web navigation, tool-use, interactive debugging, embodied robotics simulators) fits the same MDP template. Memory is also transferable: line-limited LLM-rewritten markdown is a general mechanism for self-directed state that is not specific to BabyAI.
The environment is engineering-cheap to scale. MiniGrid steps are microseconds; an instance is 1–5 MB; the OpenEnv wrapper sets max_concurrent_envs=256 out of the box. An LLM-backed environment cannot match that density.
The space of "LLMs + text-grounded navigation + memory" sits across three prior buckets. None occupies the cell we target:
| Prior work bucket | What it does | What it does not |
|---|---|---|
| BabyAI / MiniGrid (base) Chevalier-Boisvert et al., arXiv:1810.08272 (ICLR 2019); Farama-Foundation/Minigrid | Compositional language-conditioned navigation as a gym environment with a reference bot and a 10-stage difficulty curriculum | No OpenEnv/WebSocket contract; no text observation; no LLM post-training pipeline; no memory |
| Memory-augmented LLM agents Voyager (arXiv:2305.16291); Reflexion (arXiv:2303.11366); Generative Agents (arXiv:2304.03442) | Cross-episode skill libraries, verbal reflection, structured long-term memory, all prompt-engineered at inference time | No RL post-training; no branch-stable memory semantics under GRPO; not connected to OpenEnv |
| RLVR on language environments DeepSeekMath / GRPO (arXiv:2402.03300); TRL × OpenEnv (TRL docs) | Critic-free RL with verifiable rewards; standard WebSocket env contract and `rollout_func` | No persistent agent state across episodes; no first-class notion of branch-stable rollout chains |
| MiniGridEnv + MiniGridPT (ours) | OpenEnv wrap of MiniGrid/BabyAI + GRPO + cross-episodic LLM-rewritten markdown memory + branch-stable per-chain file naming | Not a human study; memory is text-only (no retrieval index) |
To our knowledge, no prior work combines an OpenEnv-native BabyAI environment with GRPO post-training, line-limited LLM-rewritten cross-episodic memory, and branch-stable memory-file naming that keeps each parallel GRPO chain anchored to a stable file across optimizer steps. The env-contract, memory semantics, and training package are the contribution; MiniGrid/BabyAI are the shoulders we stand on.
Two strictly separated packages. MiniGridEnv (the OpenEnv-compatible environment) and MiniGridPT (the GRPO training client) communicate exclusively over WebSocket. No shared Python imports. The training container is pure-GPU; the environment container is CPU-only.
Each episode:
GoToRedBall … BossLevel), seeds procedural generation, and emits a mission like "go to the red ball" or "open the door on your left, then put the green ball next to the yellow key".Thought: …\nAction: <one of 7 actions>.Discrete(7) space; the gym env steps; the wrapper builds the next text observation.+1.0 (binary reward, the GRPO-friendly default).memory/*.md file for the next episode.The agent's interface is deliberately minimal: plain Thought:/Action: text, no tool-call protocol, no JSON schema. The training client parses and steps the environment over WebSocket.
The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (MiniGridEnv/env/models.py):
# Action (agent -> env)
class MiniGridAction(Action):
command: str # "go forward", "turn left", "pickup", ...
thought: Optional[str] = None # logged for analysis, not executed
# Observation (env -> agent)
class MiniGridObservation(Observation):
text: str # NL description of the 7x7 egocentric view
mission: str # "go to the red ball", ...
step_idx, steps_remaining, max_steps: int
history: list[dict] # recent step summaries
level_name: str
last_action: Optional[str]
action_success: Optional[bool]
done: bool; reward: Optional[float]; metadata: dict
# State (hidden from agent; logging / eval only)
class MiniGridState(State):
level_name, level_difficulty, completed, truncated,
total_reward, steps_taken, optimal_steps, efficiency_ratio,
valid_actions, invalid_actions, action_distribution
MiniGrid's raw observation is a (7, 7, 3) numpy grid of (object type, color, door state) with the agent fixed at row=6 col=3 facing "up". env/grid_to_text.py turns that into a layered NL description:
Mission: …You are facing {east,south,west,north}.The internal design note is blunt: "the quality of the text observation is the single biggest lever on training success." Everything else in the environment is a thin layer over the gym loop.
env/action_parser.py maps natural-language strings to MiniGrid's discrete action index. The same logic is duplicated (intentionally) in MiniGridPT/training/openenv_runtime.py so the PT package remains standalone; a parity test guards the two copies.
| Canonical | Index | Accepted aliases |
|---|---|---|
turn left | 0 | left |
turn right | 1 | right |
go forward | 2 | move forward, forward, ahead, step, walk |
pickup | 3 | pick up, grab, take, get |
drop | 4 | release, put down |
toggle | 5 | open, close, unlock, switch |
done | 6 | wait, noop, stop |
An unparseable string falls back to go forward, not to done. Rationale: early in training, exploration beats noop; every invalid parse increments a counter so we can watch parse-rate climb with training.
env/levels.py registers the full BabyAI ladder with candidate gym IDs (so minigrid version drift between BabyAI-GoToRedBallGrey-v0 and BabyAI-GoToRedBall-v0 doesn't brick a run):
| Stage | Level | Gym ID | Max steps | Optimal |
|---|---|---|---|---|
| 0 | GoToRedBall | BabyAI-GoToRedBallGrey-v0 | 64 | ~10 |
| 1 | GoToObj | BabyAI-GoToObj-v0 | 64 | ~12 |
| 1 | GoToLocal | BabyAI-GoToLocal-v0 | 64 | ~15 |
| 2 | PickupLoc | BabyAI-PickupLoc-v0 | 64 | ~14 |
| 2 | OpenDoor | BabyAI-OpenDoor-v0 | 64 | ~12 |
| 2 | UnlockLocal | BabyAI-UnlockLocal-v0 | 128 | ~25 |
| 3 | GoTo | BabyAI-GoTo-v0 | 128 | ~30 |
| 3 | PutNextLocal | BabyAI-PutNextLocal-v0 | 128 | ~20 |
| 4 | Synth | BabyAI-Synth-v0 | 128 | ~40 |
| 4 | BossLevel | BabyAI-BossLevel-v0 | 128 | ~80 |
A single Docker container serves every stage. env.reset(level="BossLevel") switches the underlying gym env per-reset. A fix replaced the original del kwargs in reset() with a kwargs.pop("level", None), which is what unlocked single-server curriculum training. Per-level max_steps are defined in our LevelConfig registry (env/levels.py); Synth and BossLevel are capped at 128 steps in this repo so episode length (and vLLM server-mode padding budgets) stay bounded for training.
Default: binary. +1.0 on completion, 0.0 otherwise. GRPO works best with clean sparse signals. RewardConfig also supports shaped (step penalty + invalid-action penalty) and efficiency (bonus scaled to optimal_steps/steps_taken) modes if a stage stalls.
Let $r_t$ denote the per-step environment reward (binary default). With horizon $T$ (our capped max_steps), mission success at termination gives a single $+1$ spike:
In the default mode, $r_t = 0$ for all $t < T$ unless the mission completes early; shaping modes spread signal across steps via RewardConfig in env/reward.py.
OpenEnv gives us three things that matter for this submission:
rollout_func with typed Pydantic payloads and Gym-style reset/step semantics.SUPPORTS_CONCURRENT_SESSIONS=True and max_concurrent_envs=256. DDP ranks can hammer the same Space without cross-talk because each WebSocket session gets a fresh gym.Env instance (MiniGrid is not thread-safe; factory mode is mandatory).server/Dockerfile, openenv-base, port 8000), and as a Hugging Face Space during training and evaluation.No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation / State. Curriculum level, history, and per-episode metrics ride on metadata and state. The environment ships with openenv.yaml, a Dockerfile, and an HF Space.
Critically, MiniGridPT does not import MiniGridEnv. Everything crosses the wire. A MiniGridClient(EnvClient) in MiniGridPT/training/openenv_runtime.py sends plain dicts. This is the architectural lynchpin that lets the training node be pure-GPU and the environment node be CPU-only.
This is the research contribution. The base MiniGrid/BabyAI world is stateless between episodes: each reset gives the agent a fresh procedurally generated room with no persistent side-channel. We add one:
@dataclass
class MemoryConfig:
enabled: bool = False
max_lines: int = 100 # line-limit, not token-limit
memory_dir: str = "./memory"
agent_id: str = "default"
branch_stable_memory: bool = False # see below
@property
def memory_path(self) -> Path:
return Path(self.memory_dir) / f"{self.agent_id}.md"
Four deliberate design choices, each rejecting a plausible alternative:
(42/100 lines)). The model gets a concrete budget it can reason about.max_lines, keep the most-recently-written lines._temporary_vllm_max_tokensAction turns need ~128 tokens (Thought: …\nAction: go forward). The memory rewrite needs ~512 (100 lines at ~5 tokens/line worst case). One global max_completion_length cannot satisfy both. The fix is a context manager:
@contextmanager
def _temporary_vllm_max_tokens(trainer, max_tokens: int):
vg = trainer.vllm_generation
prev = vg.max_completion_length
vg.max_completion_length = max_tokens
try:
yield
finally:
vg.max_completion_length = prev
# Used both for the 512-token memory rewrite and for the 1-token
# NCCL-padding dummy generates described in the Engineering section.
GRPO runs G parallel completions per prompt, each with its own advantage and gradient contribution. If every slot writes to a uniquely-named file, there's no continuity across optimizer steps, so each memory chain is one episode long. If every slot writes to one shared file, writes race and the signal is mush.
The solution: branch-stable naming rank{R}_br{k}_{base}.md with k = slot_idx % num_generations. The k-th parallel generation maps to a stable file across optimizer steps, so branch k after prompt group P1 is the same file used by branch k after prompt group P2. Each of the G GRPO branches builds its own evolving notebook, which is what gives the model a training signal to compact and summarize episode-to-episode.
Requires per_device_train_batch_size == num_generations (otherwise multiple groups in one step hit the same k and a one-time UserWarning fires). A third scheme (a single shared file across all slots and ranks) is sketched but not landed; it needs a decision about concurrent-writer races.
Let $M_e \in \mathcal{M}$ denote the memory file (markdown string) at the start of episode $e$, let $\tau_e$ be the trajectory (observations, parsed actions, outcomes), and let $\pi_\theta^{\mathrm{mem}}$ be the same LLM invoked on the post-episode memory-update prompt. The write is a full rewrite followed by a line-budget projection $\Pi_L(\cdot)$ that keeps the last $L$ lines (here $L = 100$):
$$M_{e+1} = \Pi_L\!\left( \pi_\theta^{\mathrm{mem}}(M_e,\, \tau_e,\, \mathrm{outcome}_e) \right).$$Branch-stable filenames tie each GRPO branch index $k = s \bmod G$ to a stable path across optimizer steps, for DDP rank $R$, slot index $s$, group size $G = \texttt{num\_generations}$, and basename base (e.g. default):
This is exactly the Remember panel in Figure 0: the file card is $M_e$ at read time; the post-episode LLM box is $\pi_\theta^{\mathrm{mem}}$; the curved arrow is the next-episode read of $M_{e+1}$.
Can an LLM learn to curate its own persistent, line-budgeted notebook such that cross-episodic memory measurably improves completion rate, and the memory content evolves from random notes into structured strategies as training progresses?
The environment reward is terminal and sparse. Everything else is a small shaping bonus designed to rule out pathological regimes without dominating the signal.
| Component | Range | Source | What it rewards |
|---|---|---|---|
| Env reward (binary) | 0 or +1 | env/reward.py | Mission completed (BabyAI ground-truth success) |
| Format reward | [−0.1, +0.1] | reward_funcs.reward_format | Both Thought: and Action: present (1.0), one (0.5), neither (0.0), rescaled |
| Memory: in-budget | +0.05 | compute_memory_quality_flags | Memory rewrite stayed within max_lines (no truncation) |
| Memory: non-empty | +0.05 | compute_memory_quality_flags | Agent is actually writing something |
| Memory: not-a-dump | −0.05 | memory_looks_like_observation_dump | Penalty if memory is just a copy of the last observation |
Design principle: env reward dominates. Format and memory-quality bonuses are at ±0.1–0.15 scale, intended as training wheels, removable once the model reliably emits structured output (>90% validity) and writes substantive memory.
Let $\tau$ denote an episode trajectory and $M_e, M_{e+1}$ memory before/after the episode. Write $R_{\mathrm{env}} = \sum_t r_t \in \{0,1\}$ for the binary BabyAI success signal, $R_{\mathrm{fmt}}(\tau)$ for the rescaled format score in $[-1,1]$ (mapped to $[-0.1,0.1]$ via $\alpha_{\mathrm{fmt}} = 0.1$ in code), and $R_{\mathrm{mem}}(M_{e+1})$ for the memory-quality shaping used by the trainer. The scalar logged to TRL as env_reward is:
With indicator $\mathbf{1}[\cdot]$, line budget $L$, and dump detector $\mathrm{dump}(M)$ (true when memory is effectively a copy of the last observation):
$$R_{\mathrm{mem}}(M) = \beta_{\mathrm{budget}}\,\mathbf{1}\big[\mathrm{lines}(M) \le L\big] + \beta_{\mathrm{ne}}\,\mathbf{1}\big[M \neq \varnothing\big] - \beta_{\mathrm{dump}}\,\mathbf{1}\big[\mathrm{dump}(M)\big],$$with $\beta_{\mathrm{budget}} = \beta_{\mathrm{ne}} = \beta_{\mathrm{dump}} = 0.05$ as implemented in MiniGridPT (names may differ slightly in code; the ranges in the table above match the shipped constants).
Why memory quality has a negative flag. Without the memory_looks_like_observation_dump penalty, the shortest-path way to collect the non-empty bonus is to paste the last observation into memory. That gives zero cross-episodic signal. The penalty forces the memory to be compressed / abstracted, which is the interesting behavior.
Two strictly separated packages. MiniGridEnv (OpenEnv environment) and MiniGridPT (GRPO training client) communicate exclusively over WebSocket (no in-process imports).
flowchart LR
subgraph PT ["MiniGridPT (Training)"]
GRPO["GRPOTrainer
TRL 1.0.0"]
RF["rollout_func
(per-episode loop)"]
VLLM["vLLM
colocate/server"]
PARSE["parse_action
(NL -> Discrete(7))"]
MEM["memory/rank{R}_br{k}_default.md
(branch-stable)"]
end
subgraph ENV ["MiniGridEnv (OpenEnv)"]
WS["FastAPI
WebSocket"]
GYM["MiniGrid gym env
(BabyAI level)"]
TEXT["grid_to_text
(7x7 -> NL)"]
REW["Reward
binary +1.0"]
end
GRPO --> RF
RF --> VLLM
VLLM -->|"generate Thought/Action"| PARSE
PARSE -->|"{command, thought}"| WS
WS --> GYM
GYM --> TEXT
TEXT -->|"observation.text"| WS
WS -->|"obs + reward + done"| RF
RF -->|"post-episode rewrite"| MEM
MEM -->|"read at t=0 next episode"| RF
REW --> WS
Figure 1. System architecture. PT never imports env-side types. Memory is a per-branch markdown file owned by the training client, rewritten by the LLM at each episode end.
Training uses GRPO (Group Relative Policy Optimization), a critic-free RL algorithm ideal for terminal-only rewards. We use TRL 1.0.0's rollout_func contract for explicit control over the generate → parse → env.step loop.
Per slot, _rollout_one_episode (MiniGridPT/training/rollout_func.py) runs a complete episode inside one training step:
MiniGridClient(base_url=ENV_BASE_URL).sync() and call env.reset(level=LEVEL, seed=…).done or the turn cap: generate with vLLM (max_completion_length=128), append tokens to completion_ids with env_mask=1, parse Thought/Action, call env.step({"command": canonical, "thought": thought}), append the rendered next-observation tokens with env_mask=0.MEMORY_UPDATE_PROMPT with outcome / steps / current memory / line count / budget, call generate() wrapped in _temporary_vllm_max_tokens(trainer, 512), write to the branch-stable file, append tokens with env_mask=0.The return dict is the shape TRL's GRPOTrainer consumes:
{
"prompt_ids": list[list[int]], # one per slot (fixed initial prompt)
"completion_ids": list[list[int]], # full episode (LLM + env user turns)
"logprobs": list[list[float]], # from vLLM; zero-filled for env_mask=0
"env_mask": list[list[int]], # 1 = LLM token, 0 = env/context token
"env_reward": list[float], # binary env reward + memory-quality bonus
}
The env_mask is what lets us mix LLM-authored tokens (eligible for the env-reward term in the GRPO objective) with env-rendered context tokens (visible for KL-to-reference but excluded from the advantage weighting). Without it, the model would be "rewarded" for tokens it didn't generate.
At episode boundaries, the rollout reads memory $M_e$ into the prompt, rolls out $\tau$ with environment observations rendered as tokens with mask $0$, then applies $\pi_\theta^{\mathrm{mem}}$ to obtain $M_{e+1}$ as in Section 7, the same loop sketched in Figure 0 (Remember).
Let $y_{i,1:T_i}$ be the token sequence for completion $i$ (including env turns), $m_{i,t} \in \{0,1\}$ the env mask, and $\rho_{i,t}(\theta) = \pi_\theta(y_{i,t}\mid y_{i,1:t-1}) / \pi_{\theta_{\mathrm{old}}}(y_{i,t}\mid y_{i,1:t-1})$ the importance ratio on LLM-authored tokens. With clipping threshold $\epsilon$, KL coefficient $\beta_{\mathrm{KL}}$, and group-relative advantage $A_i = (R_i - \mu_R)/\sigma_R$ over $G$ parallel completions sharing a prompt, the masked GRPO-style surrogate we target is:
$$\mathcal{L}_{\mathrm{GRPO}}(\theta) \;=\; -\,\mathbb{E}\left[ \sum_{i=1}^{G} \sum_{t : m_{i,t}=1} \min\!\Big( \rho_{i,t}(\theta)\, A_i,\; \mathrm{clip}\big(\rho_{i,t}(\theta), 1-\epsilon, 1+\epsilon\big)\, A_i \Big) \right] + \beta_{\mathrm{KL}}\, D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\mathrm{ref}}).$$Here $R_i \equiv R(\tau_i, M_e, M_{e+1})$ is the scalar from Section 8 (environment return plus shaping). Tokens with $m_{i,t}=0$ contribute to the KL / context loss path in TRL but not to the clipped policy-gradient term above.
We do not include step-by-step episode traces here: the training logs for the current run are not immediately accessible, and the priority for this submission is the mechanism (Figure 0, Sections 7–9) rather than cherry-picked rollouts. Compute for this project is exhausted before we could finish a converged memory ablation; the author is also concurrently submitting LotteryElicitationEnv and ReasoningEconomicsEnv to the same OpenEnv track, so GPU budget is shared across multiple codebases.
The cards below are not verbatim snapshots from a finished training run. They are category placeholders for what we expect to extract once additional compute is available to run memory-structure experiments and to save real memory/rank{R}_br{k}_*.md files at checkpoints. Future work will test alternative memory organizations (Section 11) under that budget.
When a long run exists, we will snapshot memory/rank0_br0_default.md (or branch-stable peers) and categorize content. For now, each panel illustrates a type of content we expect to see at different training phases:
ball is red
i saw a door
step 3 turn left
step 4 go forward
- if the Mission says "go to X", first face X
- turn left/right before go forward if object is
to the side
- on GoToRedBall the ball is usually 1-3 steps away
- UnlockLocal: keys are the same color as doors
- OpenDoor: "toggle" opens closed and locked doors
(if carrying correct key)
- Synth: mission has multiple clauses -> do them
left-to-right as written
- if action_success=False on go forward, there is a
wall/door -> rotate before next step
- pickup with no adjacent object always fails; read
"carrying: nothing" before attempting
Verbatim memory snapshots from a converged or long partial run will replace these placeholders when further compute is available. Until then, the gallery documents the hypothesis space for how $M$ should evolve, not empirical outcomes from the current submission.
The MiniGridPT training package was exercised for correctness (short runs, parser parity, WebSocket stepping, memory file I/O, vLLM colocate and multi-GPU server mode with NCCL-safe padding). We do not report converged learning curves or final completion rates: the policy did not converge on the full curriculum under the available budget, and structured experiments on alternative memory formats are deferred to a follow-up compute cycle. Compute for this line of work is exhausted for the current submission window; the author is concurrently shipping LotteryElicitationEnv and ReasoningEconomicsEnv to the same OpenEnv track, so GPU time is shared across multiple submissions.
rollout_func, with env_mask partitioning LLM-authored vs. env-rendered tokens and per-episode logs persisting to the --output_dir.MGPT_VLLM_GPU_UTIL tuned to ~0.45–0.65 on 40 GB (see Engineering lessons)._temporary_vllm_max_tokens(trainer, 512); branch-stable filenames rank{R}_br{k}_default.md persisting across optimizer steps.DIST_SERVER_GENERATES_PER_EPISODE) eliminating NCCL desync under variable-length episodes (now bounded by our capped max_steps ≤ 128 per level in env/levels.py).bootstrap_lambda.sh → preflight_lambda.sh → run_grpo_lambda.sh, with MGPT_* env vars and cadence / metrics callbacks writing metrics_scalars.csv, metrics_events.jsonl, cadence.log, diagnostics_cadence.jsonl.Discrete(7) contract across both packages.The environment bundles three baselines (Random, BabyAI BotAgent, and a caller-provided zero-shot completion_fn), all runnable in-process without a GPU. They are not executed at scale in this submission; longer baseline sweeps and GRPO comparisons are explicitly scoped for the next compute allocation.
Because the current run did not converge and memory-structure ablations are outstanding, the table below is the forward-looking experiment matrix for how $M$ might be organized once additional GPU budget is available. Each row states a hypothesis; all rows require additional compute.
| Variant | Hypothesis tested | Notes |
|---|---|---|
| Structured schema (JSON / YAML / fixed markdown sections) | Schema > free-form markdown for stable curation | Requires additional compute |
| Append + periodic compaction | Full-episode rewrite cost limits the learning signal | Requires additional compute |
| Hierarchical (in-episode scratchpad + cross-episode long-term) | Conflating short- and long-term in one file hurts | Requires additional compute |
| Retrieval-indexed (embed notes, top-k by observation) | Linear-file recall fails at scale | Requires additional compute |
| Shared single-file across branches / ranks | Collective memory beats per-branch curation | Shared-memory design TBD; requires additional compute + concurrency design |
| Success-gated writes | Failure episodes poison $M$ | Requires additional compute |
| Variable line budget by level difficulty | Uniform $L$ is too tight for hardest stages | Requires additional compute |
| Dual-memory (policy vs. world knowledge) | Unified $M$ conflates two knowledge types | Requires additional compute |
| Token budget instead of line budget | Line-count is the wrong self-budgeting unit for LLMs | Requires additional compute |
The research question in Section 7 remains the scientific target; this submission's empirical contribution is the validated pipeline and semantics for $M$, not yet a table of win-rates.
Running GRPO + OpenEnv + vLLM on a multi-turn, memory-augmented environment surfaced three categories of structural issues. We document the ones that are general; the next OpenEnv submission is likely to hit each.
In vllm_mode=server, each trainer.vllm_generation.generate() call performs gather_object → all_gather_object → broadcast_object_list. Our rollout runs while not session.done, so different DDP ranks make different numbers of generate() calls per episode: a short run (few turns) vs. the per-level max_steps cap (64 on early BabyAI stages, up to 128 after our registry cap for Synth and BossLevel). NCCL collectives are sequence-numbered: different call counts per rank = permanent desync.
Symptoms: training tqdm stuck, GPU 0–N pinned, vLLM GPU idle, NCCL watchdog firing after its timeout, UnpicklingError as ranks deserialize off-by-one collective buffers.
Fix: fixed-count padding: every rank performs exactly DIST_SERVER_GENERATES_PER_EPISODE generates per episode, where the count is max_episode_turns + (1 if memory_enabled else 0). After the real loop terminates, _pad_vllm_server_generates_to_target issues dummy 1-token generates under _temporary_vllm_max_tokens(trainer, 1), outputs discarded, guarded with try/finally. Active only when vllm_mode == "server" and world_size > 1; reward, logprobs, and credit assignment are byte-identical to the unpadded case.
This pattern is general. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent. LotteryElicitationEnv/PT (sibling project) hit it first; the same fix ported cleanly here.
MGPT_VLLM_GPU_UTIL is passed to TRL → vLLM as --vllm_gpu_memory_utilization. vLLM interprets it as the fraction of total device VRAM the engine may reserve (weights + KV budget). Not "fraction of what PyTorch left free."
In colocate mode, the policy model loads first, then vLLM tries to grab its share on the same GPU. Too high → vLLM startup ValueError or later torch.OutOfMemoryError on logprob / lm_head. The shipped TRL default of 0.9 is too aggressive on 40 GB A100 colocate. Safe range: 0.45–0.65.
Server mode needs ≥2 GPUs (splits vLLM vs. training devices); MGPT_VLLM_MODE=auto picks server on ≥2 GPUs else colocate.
| Issue | Root cause | Fix |
|---|---|---|
| Single-server curriculum blocked | Scaffold reset() did del kwargs before forwarding to the gym env, dropping the level kwarg |
level = kwargs.pop("level", None) (and level_name) before clearing; now one Docker container serves every stage |
| Branch-stable memory races | When per_device_train_batch_size > num_generations, multiple prompt groups in one step map to the same branch index k and race on the file |
Asserted at startup; a one-time UserWarning if the invariant is ever broken; recommended configuration batch == num_generations |
| Action-parser drift | PT package ships its own parse_action (so it doesn't import the env); env-side changes can silently diverge |
Parity test tests/test_action_parser_parity.py in MiniGridPT cross-compares canonical actions + aliases against the env's parser |
| "go forward" fallback on unparseable text | Early-training LLMs emit malformed text; mapping to done kills episodes instantly (zero signal) |
Fallback = go forward, not done; every invalid parse increments invalid_actions so the parse-rate climb is a visible training-progress curve |
quadrantChart
title Grounded navigation: memory x OpenEnv/RL
x-axis "Stateless" --> "Memory-augmented"
y-axis "Gym only" --> "OpenEnv + RL stack"
quadrant-1 "Our target"
quadrant-2 "Untouched"
quadrant-3 "Classical RL"
quadrant-4 "Prompt-only"
"BabyAI / MiniGrid": [0.10, 0.14]
"Lottery (sibling env)": [0.22, 0.90]
"GRPO, no memory": [0.42, 0.62]
"Voyager": [0.90, 0.34]
"Reflexion": [0.74, 0.24]
"GenAgents": [0.88, 0.14]
"MiniGridEnv + MiniGridPT": [0.90, 0.88]
Figure 2. MiniGridEnv + MiniGridPT occupy the memory-augmented + OpenEnv + post-training quadrant that prior work leaves untouched. Voyager / Reflexion / Generative Agents are memory-rich but prompt-only; BabyAI itself is a gym env without an OpenEnv or RL-post-training story; sibling LotteryElicitationEnv is OpenEnv + RL but stateless.
GRPO is good enough to ship this submission: critic-free, works out of the box in TRL, scalar per-episode advantage is fine for short-horizon BabyAI stages. But it underweights step-level credit assignment, which is exactly what hurts on 30+ turn episodes and what memory mode needs (memory episodes are ~2× longer).
GiGPO = GRPO + anchor-state step-level advantages. Episode-level macro advantage (same group-relative signal as GRPO over $G$ completions):
$$A^{E}_i \;=\; \frac{R_i - \mu_R}{\sigma_R}.$$Step-level micro advantage within anchor-state group $S_k$ (all $(\tau,t')$ pairs whose observation text hashes match step $t$):
$$A^{S}(a_t) \;=\; \frac{Q(a_t) - \mu_{Q(S_k)}}{\sigma_{Q(S_k)}}\,,\quad S_k = \big\{ (\tau, t') : \mathrm{hash}(o_{t'}) = \mathrm{hash}(o_t) \big\}.$$Combined per-token advantage with mixing weight $\omega \ge 0$:
$$A_t \;=\; A^{E}_i + \omega\, A^{S}(a_t).$$When no anchors are found, $A^{S} = 0$ and GiGPO reduces to GRPO (equivalently $\omega = 0$).
Why this fits MiniGrid: all G rollouts share the same initial observation for a given prompt/seed (guaranteed anchor); corridor navigation revisits the same 7×7 egocentric view; BabyAI per-seed determinism creates exact hash matches. The full step-level design is deferred to the GiGPO follow-up (trainer subclass + rollout fields).
| Config | Algorithm | Memory | Flags |
|---|---|---|---|
| A | GRPO | Off | --loss_type dapo |
| B | GRPO | On (branch-stable) | --loss_type dapo --memory --memory-branch-stable |
| C | GiGPO | Off | --use_gigpo |
| D | GiGPO | On (branch-stable) | --use_gigpo --memory --memory-branch-stable |
Hypothesis: D dominates. Step-level anchor-state credit and cross-episodic strategy accumulation are complementary: GiGPO assigns credit within an episode; memory propagates credit across episodes.
| Foundation | Role in this project | Citation |
|---|---|---|
| MiniGrid & BabyAI | Base gym environment, 10-stage curriculum, reference BotAgent upper bound, procedural level generation | Chevalier-Boisvert et al., arXiv:1810.08272 (ICLR 2019); Farama-Foundation/Minigrid |
| GRPO / DeepSeekMath | Critic-free group-relative policy optimization; our default trainer via TRL's GRPOTrainer | Shao et al., arXiv:2402.03300 |
| TRL × OpenEnv | rollout_func contract, vLLM colocate/server, loss_type=dapo length-bias handling | TRL OpenEnv docs |
| OpenEnv | Standard WebSocket env contract, per-session state, create_app, HF Space deploy | HF Blog: Introducing OpenEnv |
| Voyager | Skill-library / cross-episode knowledge accumulation (closest memory-system analog; ours is RL-trained where Voyager is prompt-engineered) | Wang et al., arXiv:2305.16291 |
| Reflexion | Verbal reflection after episodes; motivates a post-episode LLM rewrite pass over a persistent buffer | Shinn et al., arXiv:2303.11366 |
| Generative Agents | Long-term memory stream with relevance / recency weighting; our line-budgeted rewrite is a deliberately simpler alternative | Park et al., arXiv:2304.03442 |
| LotteryElicitationEnv / PT | Sibling OpenEnv submission; shared structural template for two-repo split, rollout_func, NCCL generate-count padding | Same monorepo · LotteryElicitationEnv HF Space |
| ReasoningEconomicsEnv / PT | Structural template for _temporary_vllm_max_tokens pattern | Same monorepo |
Single-A100 Lambda recipe (use MiniGridEnv Docker + MiniGridPT scripts/ as the source of truth for env vars and launch order):
# 0. Clone both packages (sibling directories)
git clone https://github.com/sharma-yash01/MiniGridEnv.git
git clone https://github.com/sharma-yash01/MiniGridPT.git
# 1. Build + start MiniGridEnv (Docker on port 8000)
cd MiniGridEnv
sudo docker build -t minigrid-env:latest -f server/Dockerfile .
sudo docker run -d --name minigrid-env -p 8000:8000 \
-v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
minigrid-env:latest
curl -sS "http://127.0.0.1:8000/health"
# 2. Configure MGPT_* (single A100, colocate vLLM)
export ENV_BASE_URL="http://127.0.0.1:8000"
export MGPT_ROOT=$(pwd)/../MiniGridPT
export MGPT_VENV=$HOME/.venvs/minigridpt-lambda
export PYTORCH_WHEEL_INDEX=https://download.pytorch.org/whl/cu121
export MGPT_MODEL=Qwen/Qwen3-8B
export MGPT_LEVEL=GoToRedBall
export MGPT_VLLM_MODE=colocate
export MGPT_VLLM_GPU_UTIL=0.45 # colocate-safe on A100 40GB
# 3. Bootstrap + preflight + train
bash "$MGPT_ROOT/scripts/bootstrap_lambda.sh"
source "$MGPT_VENV/bin/activate"
bash "$MGPT_ROOT/scripts/preflight_lambda.sh"
cd "$MGPT_ROOT" && nohup bash scripts/run_grpo_lambda.sh > train.log 2>&1 &
tail -f train.log
# 4. Memory-mode variant (branch-stable, batch == num_generations)
export MGPT_MEMORY=1
export MGPT_MEMORY_MAX_LINES=100
export MGPT_MEMORY_BRANCH_STABLE=1
export MGPT_NUM_GENERATIONS=8
export MGPT_BATCH_SIZE=8
bash "$MGPT_ROOT/scripts/run_grpo_lambda.sh"
# 5. Full curriculum (GoToRedBall -> BossLevel)
export ENV_URL="${ENV_BASE_URL}"
export MODEL="${MGPT_MODEL}"
export BASE_OUT="${MGPT_OUTPUT_DIR}/curriculum"
export USE_MEMORY=1
bash "$MGPT_ROOT/scripts/launch_curriculum.sh"
All 36 env-side tests pass with cd MiniGridEnv && uv run --with pytest pytest tests. The OpenEnv contract is validated with openenv validate.
GiGPOTrainer(GRPOTrainer) subclass. Minimum diff: add obs_texts / step_boundaries to the rollout return, compute anchor-state groups, expand step advantages to tokens.inference/run_episode.py reads memory during play but does not yet mirror training's post-episode LLM memory rewrite. Evaluation should match training end-to-end; add a post-episode-memory-rewrite eval variant when more compute is available.BotAgent, and zero-shot LLM baselines with enough seeds to report completion rates and calibration vs. GRPO / GRPO+memory (deferred for lack of compute).MiniGridEnv + MiniGridPT takes the gym-native MiniGrid/BabyAI curriculum and turns it into a complete OpenEnv + GRPO + memory pipeline. The environment is a faithful wrap: text observation, NL action, BabyAI's ten stages. The training package is the extension: branch-stable markdown memory, a post-episode LLM rewrite shaped by _temporary_vllm_max_tokens, and an env-mask-aware rollout loop that makes variable-length multi-turn episodes play nicely with vLLM server mode.
The infrastructure contributions (NCCL generate-count padding for variable-length rollouts, branch-stable per-chain memory files, the max_completion_length context manager for mixed action/memory generation budgets, per-reset curriculum via reset() kwargs) are lessons the next OpenEnv + TRL 1.0 + multi-turn + memory submission will need.
Empirical completion tables and memory ablations await the next compute cycle (Section 7 for the open question; Section 11 for the planned experiment matrix). What ships with this post is the validated pipeline and the formal semantics for $M$.