[ai_generated] XPU wrong result for torch.compile adaptive_avg_pool2d flatten-sum fusion

### 🐛 Describe the bug

On torch 2.13.0.dev20260422+xpu, `torch.compile(..., backend='inductor')` on XPU returns the wrong result for `adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1)` while both CPU and XPU eager are correct.
The repro matches the backend-semantic pattern fixed upstream by the Inductor contiguous-check / exact-stride patch from pytorch/pytorch#180898.

Reproducer:

```python
import torch
import torch.nn.functional as F

print("torch", torch.__version__)
print("git", getattr(torch.version, "git_version", None))
print("xpu_available", hasattr(torch, "xpu") and torch.xpu.is_available())

def fn(x):
    y = F.adaptive_avg_pool2d(x, 7)
    return y.flatten(1).sum(dim=-1)

torch.manual_seed(42)
x_cpu = torch.randn(2, 33, 8, 8, dtype=torch.float64)
x_xpu = x_cpu.to("xpu")

cpu_out = fn(x_cpu)
xpu_eager = fn(x_xpu).cpu()
torch._dynamo.reset()
compiled = torch.compile(fn, backend="inductor")
xpu_compiled = compiled(x_xpu).cpu()

print("cpu_out", cpu_out.tolist())
print("xpu_eager", xpu_eager.tolist())
print("xpu_compiled", xpu_compiled.tolist())
print("eager_max_diff", (cpu_out - xpu_eager).abs().max().item())
print("compiled_max_diff", (cpu_out - xpu_compiled).abs().max().item())
```

Observed output:

```text
torch 2.13.0.dev20260422+xpu
git 977c5623eb1561d3dec0b8d101466c48d0709142
xpu_available True
cpu_out [6.880979122500011, -22.76314370329084]
xpu_eager [6.8809791225000225, -22.76314370329084]
xpu_compiled [6.880979122500015, -12.001003104772504]
eager_max_diff 1.1546319456101628e-14
compiled_max_diff 10.762140598518336
```

Additional context:

- Date: 2026-04-23
- Build: torch 2.13.0.dev20260422+xpu
- **Upstream reference:** https://github.com/pytorch/pytorch/issues/180848
- Related upstream PR: https://github.com/pytorch/pytorch/pull/180898
- Related upstream issue: https://github.com/pytorch/pytorch/issues/180956
- Upstream commit/PR head: cbb43c9c679b3ac73fc1d583d18544030c4db11b
- Exact run command: `scripts/run_with_xpu_python.sh artifacts/latest-repro.py`
- Expected behavior: XPU compiled output should match CPU and XPU eager output.
- Actual behavior: XPU compiled output is silently wrong for the second batch element.
- Assisted-by: opencode: github-copilot/gpt-5.4 [GitHub-API] [collect_env]

### Versions

<details>
<summary>Collected with torch/utils/collect_env.py</summary>

```text
Collecting environment information...
PyTorch version: 2.13.0.dev20260422+xpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: 18.1.3 (1ubuntu1)
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.10.20 | packaged by conda-forge | (main, Mar  5 2026, 16:42:22) [GCC 14.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-110-generic-x86_64-with-glibc2.39
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: True
XPU used to build PyTorch: 20250302
Intel GPU driver version:
* intel-opencl-icd:	25.18.33578.51-1146~24.04
* libze1:	1.24.0.0-1146~24.04
Intel GPU models onboard:
* Intel(R) Data Center GPU Max 1550
Intel GPU models detected:
* [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', device_id=0xBD5, uuid=8680d50b-2f00-0000-9a00-000000000001, driver_version='1.6.33578+51', total_memory=65520MB, local_mem_size=128KB, max_compute_units=512, memory_clock_rate=3200MHz, memory_bus_width=64-bit, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
* [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', device_id=0xBD5, uuid=8680d50b-2f00-0000-9a00-000000000002, driver_version='1.6.33578+51', total_memory=65520MB, local_mem_size=128KB, max_compute_units=512, memory_clock_rate=3200MHz, memory_bus_width=64-bit, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

Versions of relevant libraries:
[pip3] dpcpp-cpp-rt==2025.3.2
[pip3] impi-rt==2021.17.2
[pip3] intel-cmplr-lib-rt==2025.3.2
[pip3] intel-cmplr-lib-ur==2025.3.2
[pip3] intel-cmplr-lic-rt==2025.3.2
[pip3] intel-opencl-rt==2025.3.2
[pip3] intel-openmp==2025.3.2
[pip3] intel-pti==0.16.0
[pip3] intel-sycl-rt==2025.3.2
[pip3] mkl==2025.3.1
[pip3] numpy==2.2.6
[pip3] oneccl==2021.17.2
[pip3] oneccl-devel==2021.17.2
[pip3] onemkl-license==2025.3.1
[pip3] onemkl-sycl-blas==2025.3.1
[pip3] onemkl-sycl-dft==2025.3.1
[pip3] onemkl-sycl-lapack==2025.3.1
[pip3] onemkl-sycl-rng==2025.3.1
[pip3] onemkl-sycl-sparse==2025.3.1
[pip3] tbb==2022.3.1
[pip3] tcmlib==1.4.1
[pip3] torch==2.13.0.dev20260422+xpu
[pip3] torchaudio==2.11.0.dev20260422+xpu
[pip3] torchvision==0.27.0.dev20260422+xpu
[pip3] triton-xpu==3.7.1+git21033c4e
[pip3] umf==1.0.3
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ai_generated] XPU wrong result for torch.compile adaptive_avg_pool2d flatten-sum fusion #3449

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ai_generated] XPU wrong result for torch.compile adaptive_avg_pool2d flatten-sum fusion #3449

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions