Skip to content

Feature/refactoring#39

Open
ypspy wants to merge 23 commits into
masterfrom
feature/refactoring
Open

Feature/refactoring#39
ypspy wants to merge 23 commits into
masterfrom
feature/refactoring

Conversation

@ypspy

@ypspy ypspy commented Apr 3, 2026

Copy link
Copy Markdown
Owner

Note

Medium Risk
Medium risk because it introduces a large amount of new parsing/merging code and refactors the scraper entrypoint, which could change output datasets and scraping behavior if file patterns/column assumptions are off.

Overview
Refactors the DART scraping workflow into a more structured, config-driven pipeline: app.py now calls scraper/dart_scraper.py, reads config.yaml, and replaces the recursive retry logic with a loop using a configurable delay.

Adds a shared parsing utility module (parsers/common.py) plus many new parser scripts to extract audit/business-report fields into standardized TSV outputs, along with merge scripts (merge/e*.py) that combine those outputs and apply filters (year-end, industry exclusion, minimum audit hours).

Introduces project hygiene and tooling: new requirements.txt, pytest setup with unit tests for parsers/common.py, .gitignore, expanded README.md, and removes legacy/test artifacts (changeLog.md, test.py, archive/index.md).

Reviewed by Cursor Bugbot for commit 0d3d4fd. Bugbot is set up for automated code reviews on this repo. Configure here.

ypspy and others added 23 commits April 1, 2026 15:55
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add key_cols parameter to deduplicate_df for label-based comparison
- Split find_target_table into two variants with usage mapping
- Clarify (A) scraper_refactoring.py move is part of implementation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…py recursion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…over.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replaced stub README with full Korean documentation
- Added directory structure, installation, config explanation
- Added 3-step usage guide (scrape → parse → merge)
- Added pipeline table, file naming format, report type codes
- Added scraped data download links (1999–2020)
- Removed changeLog.md, archive/index.md, docs/superpowers/ plans and specs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.

aggfunc="first")

os.makedirs(OUTPUT_DIR, exist_ok=True)
df_time.to_csv(os.path.join(OUTPUT_DIR, "wp001_data_007_output.csv"), sep="\t")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parser output filenames don't match merge script expectations

High Severity

Several parsers write output files using an old naming convention (wp001_data_NNN_output.csv) that doesn't match the names the merge scripts expect (wp01.dataNNN.output.csv). For example, d4_1_time_information.py saves to wp001_data_007_output.csv, but all four merge scripts (e1, e3, e4, e5) try to read wp01.data06.output.csv. Similarly, d2_1 writes wp001_data_002_output.csv instead of wp01.data02.output.csv, d3_2 writes wp001_data_009_output.csv, and d5_1 writes wp001_data_008_output.csv. This breaks the entire pipeline since merge steps will fail with FileNotFoundError.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.

result_string = ''

if ''.join(table.text.split()).find("감사위원회") > 0:
result_string = "감사위원회"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using find() > 0 misses matches at position zero

Medium Severity

find("감사위원회") > 0 fails to detect a match when the search term appears at the very start of the text (position 0), since 0 > 0 is False. The correct check is != -1 (or using in). The same pattern appears in d5_1_total_asset.py and d5_3_inv_rec.py for "단위", ":원", "부채", and "자산" lookups, potentially causing incorrect unit detection or missed balance sheet tables.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.

Comment thread config.yaml
end_date: 20220531
report_type: "A001"
delay_seconds: 300
max_results_per_page: 15

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal filesystem paths committed in config.yaml

Medium Severity

config.yaml contains hardcoded personal paths like C:/Users/ckpys/Desktop/output and E:/workingDirectory. This file appears to be a developer's local configuration that was committed to the repository. It won't work for any other contributor and exposes a username. This file would typically be .gitignored with a config.yaml.example template provided instead.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.

Comment thread merge/e1_merge_period.py

# 감사시간 합계 100시간 미만 제거

df = df[df["감사_합계"] >= 100]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated data loading logic across all merge scripts

Low Severity

The identical block of code for loading wp01.data06.output.csv, renaming columns, filtering by December year-end, removing financial industry rows via industry.xlsx, and filtering by audit hours ≥ 100 is copy-pasted verbatim across e1_merge_period.py, e3_merge_reportdate.py, e4_merge_financials.py, and e5_audit_committee.py. Extracting this into a shared utility function would reduce maintenance burden and risk of inconsistent changes.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant