Feature/refactoring#39
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add key_cols parameter to deduplicate_df for label-based comparison - Split find_target_table into two variants with usage mapping - Clarify (A) scraper_refactoring.py move is part of implementation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…py recursion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…over.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replaced stub README with full Korean documentation - Added directory structure, installation, config explanation - Added 3-step usage guide (scrape → parse → merge) - Added pipeline table, file naming format, report type codes - Added scraped data download links (1999–2020) - Removed changeLog.md, archive/index.md, docs/superpowers/ plans and specs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.
| aggfunc="first") | ||
|
|
||
| os.makedirs(OUTPUT_DIR, exist_ok=True) | ||
| df_time.to_csv(os.path.join(OUTPUT_DIR, "wp001_data_007_output.csv"), sep="\t") |
There was a problem hiding this comment.
Parser output filenames don't match merge script expectations
High Severity
Several parsers write output files using an old naming convention (wp001_data_NNN_output.csv) that doesn't match the names the merge scripts expect (wp01.dataNNN.output.csv). For example, d4_1_time_information.py saves to wp001_data_007_output.csv, but all four merge scripts (e1, e3, e4, e5) try to read wp01.data06.output.csv. Similarly, d2_1 writes wp001_data_002_output.csv instead of wp01.data02.output.csv, d3_2 writes wp001_data_009_output.csv, and d5_1 writes wp001_data_008_output.csv. This breaks the entire pipeline since merge steps will fail with FileNotFoundError.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.
| result_string = '' | ||
|
|
||
| if ''.join(table.text.split()).find("감사위원회") > 0: | ||
| result_string = "감사위원회" |
There was a problem hiding this comment.
Using find() > 0 misses matches at position zero
Medium Severity
find("감사위원회") > 0 fails to detect a match when the search term appears at the very start of the text (position 0), since 0 > 0 is False. The correct check is != -1 (or using in). The same pattern appears in d5_1_total_asset.py and d5_3_inv_rec.py for "단위", ":원", "부채", and "자산" lookups, potentially causing incorrect unit detection or missed balance sheet tables.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.
| end_date: 20220531 | ||
| report_type: "A001" | ||
| delay_seconds: 300 | ||
| max_results_per_page: 15 |
There was a problem hiding this comment.
Personal filesystem paths committed in config.yaml
Medium Severity
config.yaml contains hardcoded personal paths like C:/Users/ckpys/Desktop/output and E:/workingDirectory. This file appears to be a developer's local configuration that was committed to the repository. It won't work for any other contributor and exposes a username. This file would typically be .gitignored with a config.yaml.example template provided instead.
Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.
|
|
||
| # 감사시간 합계 100시간 미만 제거 | ||
|
|
||
| df = df[df["감사_합계"] >= 100] |
There was a problem hiding this comment.
Duplicated data loading logic across all merge scripts
Low Severity
The identical block of code for loading wp01.data06.output.csv, renaming columns, filtering by December year-end, removing financial industry rows via industry.xlsx, and filtering by audit hours ≥ 100 is copy-pasted verbatim across e1_merge_period.py, e3_merge_reportdate.py, e4_merge_financials.py, and e5_audit_committee.py. Extracting this into a shared utility function would reduce maintenance burden and risk of inconsistent changes.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.


Note
Medium Risk
Medium risk because it introduces a large amount of new parsing/merging code and refactors the scraper entrypoint, which could change output datasets and scraping behavior if file patterns/column assumptions are off.
Overview
Refactors the DART scraping workflow into a more structured, config-driven pipeline:
app.pynow callsscraper/dart_scraper.py, readsconfig.yaml, and replaces the recursive retry logic with a loop using a configurable delay.Adds a shared parsing utility module (
parsers/common.py) plus many new parser scripts to extract audit/business-report fields into standardized TSV outputs, along with merge scripts (merge/e*.py) that combine those outputs and apply filters (year-end, industry exclusion, minimum audit hours).Introduces project hygiene and tooling: new
requirements.txt,pytestsetup with unit tests forparsers/common.py,.gitignore, expandedREADME.md, and removes legacy/test artifacts (changeLog.md,test.py,archive/index.md).Reviewed by Cursor Bugbot for commit 0d3d4fd. Bugbot is set up for automated code reviews on this repo. Configure here.