There are too many word "alamy" in Union14M-U in cc_lmdb

Observation:
I have checked randomly about 1000 images of cc_lmdb, it turns out that about 20% of cc_lmdb is word "alamy". 

![Image](https://github.com/user-attachments/assets/0ea5b6dd-27ff-4a41-bc9c-f464b6157db7)

Questions/Concerns:
1. Is this expected behavior for cc_lmdb?
2. Could this skew model training?