Skip to content

Commit f648583

Browse files
committed
feat: add CSV and TSV column input support
1 parent 45ee143 commit f648583

5 files changed

Lines changed: 189 additions & 12 deletions

File tree

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,19 @@ All notable changes to this repository will be documented in this file.
66

77
- Ongoing documentation and release polish.
88

9+
## [v0.5.0] - 2026-03-13
10+
11+
### Added
12+
13+
- `--column COLUMN_NAME` support for reading paper ids / URLs from named columns in `.csv` and `.tsv` input files
14+
15+
### Changed
16+
17+
- plain-text `--input-file` behavior remains line-based and backward compatible
18+
- structured CSV/TSV inputs now ignore blank rows, comment-only rows, and blank selected cells
19+
- CSV/TSV files without `--column` now auto-select an input column only when it is unambiguous; otherwise the CLI fails with the available column names
20+
- README, skill instructions, packaged artifact, and CI smoke coverage now document and verify structured file input handling
21+
922
## [v0.4.1] - 2026-03-13
1023

1124
### Added

README.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,13 +122,46 @@ EOF
122122
python3 scripts/alphaxiv_lookup.py --input-file papers.txt --format brief
123123
```
124124

125+
### Batch lookup from CSV with an explicit column
126+
127+
```bash
128+
cat > papers.csv <<'EOF'
129+
paper_id,title
130+
2603.07612,Example Paper
131+
# comment row,
132+
,Missing id
133+
https://arxiv.org/abs/2401.12345,Another Paper
134+
EOF
135+
python3 scripts/alphaxiv_lookup.py --input-file papers.csv --column paper_id --format brief
136+
```
137+
138+
### Batch lookup from TSV with an obvious default column
139+
140+
```bash
141+
cat > papers.tsv <<'EOF'
142+
paper_id title
143+
2603.07612 Example Paper
144+
2401.12345 Another Paper
145+
EOF
146+
python3 scripts/alphaxiv_lookup.py --input-file papers.tsv --format json-compact
147+
```
148+
125149
### Combine `--input-file` with direct arguments
126150

127151
```bash
128152
python3 scripts/alphaxiv_lookup.py --input-file papers.txt 'https://www.alphaxiv.org/overview/2501.01234' --format json-compact
129153
```
130154

131-
`--input-file PATH` reads one paper id or URL per line, ignores blank lines, ignores lines starting with `#`, and participates in the same single-item vs batch rendering rules as direct positional arguments.
155+
`--input-file PATH` keeps `.txt` and other non-structured files line-based: one paper id or URL per line, with blank lines and lines starting with `#` ignored.
156+
157+
For `.csv` and `.tsv` files, the CLI reads a header row and then pulls values from a named column:
158+
159+
- use `--column COLUMN_NAME` to select the input column explicitly
160+
- blank rows, comment-only rows, and rows where the selected column is blank are ignored
161+
- if `--column` is omitted, the CLI only auto-selects a column when it is unambiguous (for example the file has exactly one column, or exactly one clearly named input column such as `paper_id` or `url`)
162+
- otherwise it fails clearly and prints the available column names
163+
164+
Structured-file inputs participate in the same single-item vs batch rendering rules as direct positional arguments, and `--input-file` can still be combined with direct ids / URLs in the same command.
132165

133166
## Output fields
134167

@@ -194,6 +227,7 @@ Structure:
194227
- `--format brief` / `--format brief-zh` prefer the best retrieved summary, but can still produce a useful user-facing brief from the arXiv abstract alone
195228
- Batch mode accepts multiple ids / URLs in one run and keeps single-item behavior backward compatible
196229
- `--input-file PATH` can be used more than once and can be combined with direct ids / URLs in the same command
230+
- `.csv` and `.tsv` inputs support header-based extraction through `--column COLUMN_NAME`, while plain text files keep the existing line-by-line behavior unchanged
197231
- AlphaXiv is treated as a shortcut, not a replacement for reading the full paper when exact details matter
198232

199233
## License

SKILL.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,12 @@ Prefer alphaXiv first because it often exposes an AI-generated overview that is
1717
- Accept alphaXiv URLs like `https://www.alphaxiv.org/overview/2401.12345`
1818
2. Run the bundled script:
1919
- The script accepts one or more paper ids / URLs in a single invocation.
20-
- Use `--input-file PATH` to read one id / URL per line; ignore blank lines and lines starting with `#`.
20+
- Use `--input-file PATH` to add repo-local batch inputs.
21+
- Plain-text inputs stay line-based: read one id / URL per line, ignoring blank lines and lines starting with `#`.
22+
- CSV/TSV inputs use a header row. Prefer `--column COLUMN_NAME` to select the input column explicitly.
23+
- If `--column` is omitted for CSV/TSV, the script only auto-selects an obvious single input column; otherwise it fails and prints the available columns.
2124
- `python3 scripts/alphaxiv_lookup.py "<paper-or-url>" --format markdown`
25+
- `python3 scripts/alphaxiv_lookup.py --input-file papers.csv --column paper_id --format json`
2226
- Use `--format json` for full structured output.
2327
- Use `--format json-compact` when you want a smaller machine-friendly payload.
2428
- Use `--format text` for a clean plain-text brief.

dist/alphaxiv-paper-lookup.skill

1.11 KB
Binary file not shown.

scripts/alphaxiv_lookup.py

Lines changed: 136 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
#!/usr/bin/env python3
22
import argparse
3+
import csv
34
import html
45
import json
56
import re
@@ -87,6 +88,22 @@
8788
)
8889

8990

91+
OBVIOUS_INPUT_COLUMN_NAMES = {
92+
"paper",
93+
"paperid",
94+
"paperurl",
95+
"arxiv",
96+
"arxivid",
97+
"arxivurl",
98+
"url",
99+
"link",
100+
}
101+
102+
103+
class InputFileError(ValueError):
104+
pass
105+
106+
90107
def fetch(url: str, timeout: int = 25) -> str:
91108
req = urllib.request.Request(url, headers={"User-Agent": USER_AGENT})
92109
with urllib.request.urlopen(req, timeout=timeout) as resp:
@@ -758,7 +775,110 @@ def render_many(results: List[Dict[str, object]], output_format: str) -> str:
758775
return ("\n\n" + ("=" * 80) + "\n\n").join(blocks) + "\n"
759776

760777

761-
def read_input_file(path: str) -> List[str]:
778+
def canonicalize_column_name(name: str) -> str:
779+
return re.sub(r"[\s_-]+", "", name.strip().lower())
780+
781+
782+
def nonempty_row_values(row: List[str]) -> List[str]:
783+
return [cell.strip() for cell in row if cell and cell.strip()]
784+
785+
786+
def is_blank_row(row: List[str]) -> bool:
787+
return not nonempty_row_values(row)
788+
789+
790+
def is_comment_only_row(row: List[str]) -> bool:
791+
values = nonempty_row_values(row)
792+
return len(values) == 1 and values[0].startswith("#")
793+
794+
795+
def visible_column_names(columns: List[str]) -> List[str]:
796+
return [column for column in columns if column]
797+
798+
799+
def obvious_input_column_index(columns: List[str]) -> Optional[int]:
800+
indexed = [(idx, column) for idx, column in enumerate(columns) if column]
801+
if len(indexed) == 1:
802+
return indexed[0][0]
803+
804+
matches = [
805+
(idx, column)
806+
for idx, column in indexed
807+
if canonicalize_column_name(column) in OBVIOUS_INPUT_COLUMN_NAMES
808+
]
809+
if len(matches) == 1:
810+
return matches[0][0]
811+
return None
812+
813+
814+
def resolve_structured_input_column(path: str, columns: List[str], column_name: Optional[str]) -> int:
815+
indexed = [(idx, column) for idx, column in enumerate(columns) if column]
816+
if not indexed:
817+
raise InputFileError(f"structured input file '{path}' has an empty header row")
818+
819+
normalized_to_indexes: Dict[str, List[int]] = {}
820+
for idx, column in indexed:
821+
normalized_to_indexes.setdefault(canonicalize_column_name(column), []).append(idx)
822+
823+
available = ", ".join(visible_column_names(columns))
824+
825+
if column_name:
826+
requested = canonicalize_column_name(column_name)
827+
matches = normalized_to_indexes.get(requested, [])
828+
if not matches:
829+
raise InputFileError(
830+
f"structured input file '{path}' does not contain column '{column_name}'; available columns: {available}"
831+
)
832+
if len(matches) > 1:
833+
raise InputFileError(
834+
f"structured input file '{path}' has multiple columns matching '{column_name}'; available columns: {available}"
835+
)
836+
return matches[0]
837+
838+
obvious_index = obvious_input_column_index(columns)
839+
if obvious_index is not None:
840+
return obvious_index
841+
842+
raise InputFileError(
843+
f"structured input file '{path}' requires --column COLUMN_NAME; available columns: {available}"
844+
)
845+
846+
847+
def read_structured_input_file(path: str, delimiter: str, column_name: Optional[str]) -> List[str]:
848+
papers: List[str] = []
849+
with open(path, "r", encoding="utf-8-sig", newline="") as handle:
850+
reader = csv.reader(handle, delimiter=delimiter)
851+
852+
columns: Optional[List[str]] = None
853+
for row in reader:
854+
if is_blank_row(row) or is_comment_only_row(row):
855+
continue
856+
columns = [cell.strip() for cell in row]
857+
break
858+
859+
if columns is None:
860+
return papers
861+
862+
column_index = resolve_structured_input_column(path, columns, column_name)
863+
864+
for row in reader:
865+
if is_blank_row(row) or is_comment_only_row(row):
866+
continue
867+
value = row[column_index].strip() if column_index < len(row) else ""
868+
if not value or value.startswith("#"):
869+
continue
870+
papers.append(value)
871+
872+
return papers
873+
874+
875+
def read_input_file(path: str, column_name: Optional[str] = None) -> List[str]:
876+
lowered_path = path.lower()
877+
if lowered_path.endswith(".csv"):
878+
return read_structured_input_file(path, ",", column_name)
879+
if lowered_path.endswith(".tsv"):
880+
return read_structured_input_file(path, "\t", column_name)
881+
762882
papers: List[str] = []
763883
with open(path, "r", encoding="utf-8") as handle:
764884
for raw_line in handle:
@@ -769,7 +889,7 @@ def read_input_file(path: str) -> List[str]:
769889
return papers
770890

771891

772-
def expand_cli_inputs(argv: List[str]) -> List[str]:
892+
def expand_cli_inputs(argv: List[str], input_column: Optional[str] = None) -> List[str]:
773893
papers: List[str] = []
774894
index = 0
775895

@@ -784,20 +904,20 @@ def expand_cli_inputs(argv: List[str]) -> List[str]:
784904
index += 1
785905
if index >= len(argv):
786906
break
787-
papers.extend(read_input_file(argv[index]))
907+
papers.extend(read_input_file(argv[index], input_column))
788908
index += 1
789909
continue
790910

791911
if token.startswith("--input-file="):
792-
papers.extend(read_input_file(token.split("=", 1)[1]))
912+
papers.extend(read_input_file(token.split("=", 1)[1], input_column))
793913
index += 1
794914
continue
795915

796-
if token in {"--format", "--timeout"}:
916+
if token in {"--column", "--format", "--timeout"}:
797917
index += 2
798918
continue
799919

800-
if token.startswith("--format=") or token.startswith("--timeout="):
920+
if token.startswith("--column=") or token.startswith("--format=") or token.startswith("--timeout="):
801921
index += 1
802922
continue
803923

@@ -921,17 +1041,23 @@ def main(argv: Optional[List[str]] = None) -> int:
9211041
action="append",
9221042
default=[],
9231043
metavar="PATH",
924-
help="Read one paper id or URL per line from PATH. Blank lines and lines starting with # are ignored.",
1044+
help="Read paper ids or URLs from PATH. Text files stay line-based; CSV/TSV files support header-based column selection.",
1045+
)
1046+
parser.add_argument(
1047+
"--column",
1048+
help="For CSV/TSV --input-file values, read paper ids or URLs from COLUMN_NAME. If omitted, an obvious structured column is used only when it can be chosen unambiguously.",
9251049
)
9261050
parser.add_argument("--format", choices=["json", "json-compact", "markdown", "text", "brief", "brief-zh"], default="json")
9271051
parser.add_argument("--timeout", type=int, default=25, help="HTTP timeout in seconds (default: 25)")
9281052
args = parser.parse_args(argv)
9291053

9301054
try:
931-
papers = expand_cli_inputs(argv)
932-
except OSError as err:
1055+
papers = expand_cli_inputs(argv, input_column=args.column)
1056+
except (InputFileError, OSError) as err:
9331057
path = err.filename or "<unknown>"
934-
parser.error(f"unable to read input file '{path}': {err.strerror or err}")
1058+
if isinstance(err, OSError):
1059+
parser.error(f"unable to read input file '{path}': {err.strerror or err}")
1060+
parser.error(str(err))
9351061

9361062
if not papers:
9371063
parser.error("provide at least one paper id / URL or --input-file PATH")

0 commit comments

Comments
 (0)