Feature/refactoring by ypspy · Pull Request #39 · ypspy/dart-scraping

ypspy · 2026-04-03T22:01:35Z

Note

Medium Risk
Medium risk because it introduces a large amount of new parsing/merging code and refactors the scraper entrypoint, which could change output datasets and scraping behavior if file patterns/column assumptions are off.

Overview
Refactors the DART scraping workflow into a more structured, config-driven pipeline: app.py now calls scraper/dart_scraper.py, reads config.yaml, and replaces the recursive retry logic with a loop using a configurable delay.

Adds a shared parsing utility module (parsers/common.py) plus many new parser scripts to extract audit/business-report fields into standardized TSV outputs, along with merge scripts (merge/e*.py) that combine those outputs and apply filters (year-end, industry exclusion, minimum audit hours).

Introduces project hygiene and tooling: new requirements.txt, pytest setup with unit tests for parsers/common.py, .gitignore, expanded README.md, and removes legacy/test artifacts (changeLog.md, test.py, archive/index.md).

^{Reviewed by Cursor Bugbot for commit 0d3d4fd. Bugbot is set up for automated code reviews on this repo. Configure here.}

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add key_cols parameter to deduplicate_df for label-based comparison - Split find_target_table into two variants with usage mapping - Clarify (A) scraper_refactoring.py move is part of implementation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…py recursion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…over.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_cover.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replaced stub README with full Korean documentation - Added directory structure, installation, config explanation - Added 3-step usage guide (scrape → parse → merge) - Added pipeline table, file naming format, report type codes - Added scraped data download links (1999–2020) - Removed changeLog.md, archive/index.md, docs/superpowers/ plans and specs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.}

cursor · 2026-04-03T22:07:30Z

+                              aggfunc="first")
+
+os.makedirs(OUTPUT_DIR, exist_ok=True)
+df_time.to_csv(os.path.join(OUTPUT_DIR, "wp001_data_007_output.csv"), sep="\t")


Parser output filenames don't match merge script expectations

High Severity

Several parsers write output files using an old naming convention (wp001_data_NNN_output.csv) that doesn't match the names the merge scripts expect (wp01.dataNNN.output.csv). For example, d4_1_time_information.py saves to wp001_data_007_output.csv, but all four merge scripts (e1, e3, e4, e5) try to read wp01.data06.output.csv. Similarly, d2_1 writes wp001_data_002_output.csv instead of wp01.data02.output.csv, d3_2 writes wp001_data_009_output.csv, and d5_1 writes wp001_data_008_output.csv. This breaks the entire pipeline since merge steps will fail with FileNotFoundError.

Additional Locations (2)

parsers/d2_1_audit_cover_period.py#L71-L72

parsers/d3_2_audit_opinion.py#L78-L79

^{Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.}

cursor · 2026-04-03T22:07:30Z

+    result_string = ''
+
+    if ''.join(table.text.split()).find("감사위원회") > 0:
+        result_string = "감사위원회"


Using find() > 0 misses matches at position zero

Medium Severity

find("감사위원회") > 0 fails to detect a match when the search term appears at the very start of the text (position 0), since 0 > 0 is False. The correct check is != -1 (or using in). The same pattern appears in d5_1_total_asset.py and d5_3_inv_rec.py for "단위", ":원", "부채", and "자산" lookups, potentially causing incorrect unit detection or missed balance sheet tables.

Additional Locations (1)

parsers/d5_1_total_asset.py#L45-L46

^{Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.}

cursor · 2026-04-03T22:07:30Z

+  end_date: 20220531
+  report_type: "A001"
+  delay_seconds: 300
+  max_results_per_page: 15


Personal filesystem paths committed in config.yaml

Medium Severity

config.yaml contains hardcoded personal paths like C:/Users/ckpys/Desktop/output and E:/workingDirectory. This file appears to be a developer's local configuration that was committed to the repository. It won't work for any other contributor and exposes a username. This file would typically be .gitignored with a config.yaml.example template provided instead.

^{Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.}

cursor · 2026-04-03T22:07:30Z

+
+# 감사시간 합계 100시간 미만 제거
+
+df = df[df["감사_합계"] >= 100]


Duplicated data loading logic across all merge scripts

Low Severity

The identical block of code for loading wp01.data06.output.csv, renaming columns, filtering by December year-end, removing financial industry rows via industry.xlsx, and filtering by audit hours ≥ 100 is copy-pasted verbatim across e1_merge_period.py, e3_merge_reportdate.py, e4_merge_financials.py, and e5_audit_committee.py. Extracting this into a shared utility function would reduce maintenance burden and risk of inconsistent changes.

Additional Locations (2)

merge/e3_merge_reportdate.py#L13-L42

merge/e5_audit_committee.py#L13-L51

^{Reviewed by Cursor Bugbot for commit 0d3d4fd. Configure here.}

ypspy and others added 23 commits April 1, 2026 15:55

Add refactoring design spec for DART scraper

336e194

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add implementation plan for DART scraper refactoring

e57b3cd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: add .gitignore with .worktrees/ exclusion

b5009b6

chore: scaffold directory structure and add config/requirements

0f013c4

chore: fix output dir structure and add tests/.gitkeep

84c6e9b

chore: add output/.gitkeep to track empty output directory

9e62c83

feat: add parsers/common.py with shared utility functions (TDD)

679ebdb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: code quality improvements in common.py and test_common.py

d6d68ca

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: migrate Dart_Scraper.py to scraper/dart_scraper.py and fix app.…

cdb0753

…py recursion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: add SUB_DOC_NO_SLICE constant and restore delay_seconds to 300

56e6b11

feat: migrate D-1 businessReportCover to parsers/d1_business_report_c…

efc5c02

…over.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: remove unused imports and dead functions from d1_business_report…

8d02c05

…_cover.py

feat: migrate D-2 and D-3 parsers to parsers/ package

7edd28e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: remove unused pandas imports and fix file handle collision in d3_1

699759c

feat: migrate D-4, D-5, D-6 parsers to parsers/ package

b73ee0a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: migrate D-7-1 extractGovernance to parsers/d7_1_governance.py

8c663de

feat: migrate E-series merge scripts to merge/ package

2795106

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: cast 감사_합계 to numeric before comparison in merge scripts

2fdf2c0

chore: move legacy D/E-series files to archive and remove dev artifacts

05ce261

fix: remove unused pandas import, fix SyntaxWarning escape sequences

3e187cd

docs: remove unavailable data download section from README

0d3d4fd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/refactoring#39

Feature/refactoring#39
ypspy wants to merge 23 commits into
masterfrom
feature/refactoring

ypspy commented Apr 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 3, 2026

Uh oh!

cursor Bot Apr 3, 2026

Uh oh!

cursor Bot Apr 3, 2026

Uh oh!

cursor Bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		# 감사시간 합계 100시간 미만 제거

		df = df[df["감사_합계"] >= 100]

Conversation

ypspy commented Apr 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 3, 2026

Choose a reason for hiding this comment

Parser output filenames don't match merge script expectations

Uh oh!

cursor Bot Apr 3, 2026

Choose a reason for hiding this comment

Using find() > 0 misses matches at position zero

Uh oh!

cursor Bot Apr 3, 2026

Choose a reason for hiding this comment

Personal filesystem paths committed in config.yaml

Uh oh!

cursor Bot Apr 3, 2026

Choose a reason for hiding this comment

Duplicated data loading logic across all merge scripts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ypspy commented Apr 3, 2026 •

edited by cursor Bot

Loading

Using `find() > 0` misses matches at position zero