Document step() action-scaling contract; add _validate_action_shares hook#1416
Open
guitaristsam wants to merge 1 commit intoAI4Finance-Foundation:masterfrom
Open
Document step() action-scaling contract; add _validate_action_shares hook#1416guitaristsam wants to merge 1 commit intoAI4Finance-Foundation:masterfrom
guitaristsam wants to merge 1 commit intoAI4Finance-Foundation:masterfrom
Conversation
…es` hook
`StockTradingEnv.step()` rescales the input `actions` from [-1, 1] to
integer share counts via `actions = actions * self.hmax; actions.astype(int)`
before any `_buy_stock` / `_sell_stock` call sees it. This is correct, but
subclasses overriding `step()` to add custom action constraints commonly
make one of two mistakes:
1. Validate / clip / round the raw [-1, 1] action BEFORE the parent's
`* self.hmax` rescaling. Net effect: the action grid collapses to
{-hmax, 0, +hmax} with no fine-grained levels in between, and the
policy `std` gets stuck during PPO training because there's no
gradient benefit to outputting fractional values.
2. Re-implement the entire `step()` flow to inject custom logic, which
means the subclass has to track FinRL's internal state-layout
conventions (`state[1:1+stock_dim]` is prices, not shares — a separate
trap that this PR does not address but is mentioned as related).
This PR adds two non-breaking changes:
- A docstring on `step()` explicitly stating the [-1, 1] -> integer-shares
rescaling contract and pointing subclassers at the new hook.
- A `_validate_action_shares(action_shares)` method that runs after the
scaling and turbulence override but before the buy/sell loop. Default
implementation is identity, so existing subclasses are unaffected.
Subclasses that want post-rescaling constraints (per-step share-delta
caps, integer-only enforcement, tighter budget pre-checks, no-shorting)
can now override this single method instead of re-implementing `step()`.
Note on FinRL-X: I'm aware the project is moving toward FinRL-X, which
uses a weight-centric interface and has no Gym env at all. This PR
targets legacy FinRL specifically because the action-scaling trap is an
issue for users following the existing educational/research tutorials,
which explicitly remain on this repo.
Real-world reproducer: I hit this trap while subclassing StockTradingEnv
to enforce integer-share trading on Indian equities. v6 of my project
silently collapsed the policy's action grid; the smoking gun was the
PPO `std` parameter stuck at 0.97 across all 200k training timesteps.
Tracking it down required reading line 314 of env_stocktrading.py and
the offending subclass side-by-side. Full debugging diary at
https://github.com/guitaristsam/RL_PPO_NIFTY50 (see the "Action scaling"
section of CLAUDE.md).
Author
|
The failing pre-commit.ci check is not caused by this PR. Diagnostic:
I deliberately did not include the reformat of unrelated files in this PR to keep it focused on the action-scaling documentation and The patch itself (43 lines added, 0 removed) passes black, reorder-python-imports, and pyupgrade locally on the file alone. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
StockTradingEnv.step()rescales the inputactionsfrom[-1, 1]to integer share counts viaactions = actions * self.hmax; actions.astype(int)before any_buy_stock/_sell_stockcall sees it. This is correct behaviour, but subclasses overridingstep()to add custom action constraints commonly:Validate / round / clip the raw
[-1, 1]action before the parent's* self.hmaxrescaling. Net effect: the action grid collapses to{-hmax, 0, +hmax}with no fine-grained levels in between, and the policystdparameter gets stuck during PPO training (no gradient benefit to outputting fractional values).Re-implement the entire
step()flow to inject custom logic, which means the subclass has to re-track FinRL's internal state-layout conventions.What this PR changes
Two non-breaking additions to
env_stocktrading.py:Docstring on
step()explicitly stating the[-1, 1]→ integer-shares rescaling contract and pointing subclasses at the new hook._validate_action_shares(action_shares)method, called after the rescaling and turbulence override but before the buy/sell loop. Default implementation is identity, so existing subclasses are unaffected. Subclasses that want post-rescaling constraints (per-step share-delta caps, integer-only enforcement, tighter budget pre-checks, no-shorting) can override this single method instead of re-implementingstep().Total: 43 lines added, 0 lines removed, no behaviour change for existing users.
Note on FinRL-X
I'm aware development is shifting to FinRL-X, which uses a weight-centric interface and has no Gym env. This PR targets legacy FinRL specifically because the action-scaling trap affects users following the existing educational/research tutorials, which explicitly remain on this repo.
Real-world reproducer
Hit this trap while subclassing
StockTradingEnvto enforce integer-share trading on Indian equities. The smoking gun was the PPOstdparameter stuck at 0.97 across 200k training timesteps. Full debugging diary, including 13 versions of single-variable iteration, at https://github.com/guitaristsam/RL_PPO_NIFTY50 — see the "Action scaling" section ofCLAUDE.md.