Skip to content

Commit bb8d5a9

Browse files
committed
refactor: parser tests to use internal package and enhance functionality
- Updated parser tests to import from the internal package instead of the main parser package. - Added new test cases for paragraph parsing to handle line breaks. - Improved the tokenizer to support multi-character tokens and handle consecutive spaces. - Introduced a renderer package with Markdown, HTML, and String renderers for converting AST back to various output formats. - Implemented a factory pattern for creating renderers and adapted existing renderers to a common interface. - Removed the restore package as its functionality is now integrated into the renderers. - Enhanced HTML and Markdown renderers to support new features and improved output accuracy.
1 parent c9fa41c commit bb8d5a9

93 files changed

Lines changed: 4075 additions & 649 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

ARCHITECTURE.md

Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
# gomark Architecture
2+
3+
This document explains the architectural decisions and design philosophy behind gomark.
4+
5+
## Design Philosophy
6+
7+
gomark is built on the principle of **pragmatic simplicity**:
8+
9+
> "Solve real problems efficiently without over-engineering"
10+
11+
### Core Principles
12+
13+
1. **Simplicity over Complexity**: Choose the simplest solution that works
14+
2. **Performance over Features**: Fast, reliable parsing over theoretical completeness
15+
3. **Maintainability over Flexibility**: Code that's easy to understand and modify
16+
4. **Real Needs over Theoretical Needs**: Implement what's actually used
17+
5. **Direct Solutions**: Avoid layers of abstraction when direct approaches work
18+
19+
## Architectural Decisions
20+
21+
### 1. Token-Based Parsing ✅
22+
23+
**Decision**: Use single-pass tokenization followed by token-based parsing
24+
25+
**Rationale**:
26+
- **Performance**: Single-pass tokenization is very fast
27+
- **Simplicity**: Tokens are easy to work with and debug
28+
- **Reusability**: Tokens can be reused by multiple parsers
29+
- **Memory Efficiency**: Tokens reference original string data
30+
31+
**Alternative Considered**: Text-based parsing (like goldmark)
32+
**Why Rejected**: Added complexity without clear benefits for our use cases
33+
34+
### 2. Simple AST Interface ✅
35+
36+
**Decision**: Use minimal `Node` interface with direct field access
37+
38+
```go
39+
type Node interface {
40+
Type() NodeType
41+
Restore() string
42+
}
43+
```
44+
45+
**Rationale**:
46+
- **Performance**: Direct field access (`node.Children`) is faster than method calls
47+
- **Simplicity**: Easy to understand and work with
48+
- **Focused**: Only implements what's actually needed
49+
- **Memory Efficient**: No overhead for unused tree navigation features
50+
51+
**Alternative Considered**: Complex tree interface (like goldmark)
52+
**Why Rejected**: Analysis showed no actual usage of tree navigation in our codebase
53+
54+
### 3. Stateless Parsers ✅
55+
56+
**Decision**: Each parser is independent and stateless
57+
58+
**Rationale**:
59+
- **Simplicity**: No complex context management
60+
- **Debuggability**: Easy to test individual parsers
61+
- **Performance**: No context overhead
62+
- **Maintainability**: Clear separation of concerns
63+
64+
**Alternative Considered**: Context-heavy parsing
65+
**Why Rejected**: Added complexity without clear benefits
66+
67+
### 4. String-Based Node Types ✅
68+
69+
**Decision**: Use `NodeType string` constants
70+
71+
```go
72+
type NodeType string
73+
const ParagraphNode NodeType = "PARAGRAPH"
74+
```
75+
76+
**Rationale**:
77+
- **Debuggability**: Easy to inspect and debug
78+
- **Simplicity**: No complex type hierarchies
79+
- **Extensibility**: Easy to add new types
80+
- **JSON-Friendly**: Serializes naturally
81+
82+
**Alternative Considered**: Interface-based type system
83+
**Why Rejected**: Unnecessary complexity for our needs
84+
85+
### 5. Configuration-Based Extensions ✅
86+
87+
**Decision**: Use configuration to enable/disable features
88+
89+
**Rationale**:
90+
- **Performance**: Disabled features have zero overhead
91+
- **Flexibility**: Easy to customize for different use cases
92+
- **Maintainability**: Clear feature boundaries
93+
- **User-Friendly**: Simple API for configuration
94+
95+
### 6. Buffer-Based Rendering ✅
96+
97+
**Decision**: Use `bytes.Buffer` for output accumulation
98+
99+
**Rationale**:
100+
- **Performance**: Efficient string building
101+
- **Memory**: Reusable buffers
102+
- **Simplicity**: Standard Go pattern
103+
- **Flexibility**: Easy to extend
104+
105+
## Package Organization
106+
107+
### Public vs Internal
108+
109+
**Public Packages** (goldmark-style):
110+
```
111+
├── ast/ # AST definitions - users need access
112+
├── config/ # Configuration - users need to configure
113+
├── parser/ # Parser interfaces - users may extend
114+
├── renderer/ # Renderer interfaces - users may extend
115+
```
116+
117+
**Internal Implementation**:
118+
```
119+
└── parser/internal/ # Parser implementations - users don't need access
120+
```
121+
122+
**Rationale**:
123+
- Public APIs allow extensibility where it matters
124+
- Internal packages keep implementation details hidden
125+
- Follows goldmark patterns for familiarity
126+
127+
## Performance Optimizations
128+
129+
### 1. Minimal Allocations
130+
- Reuse token slices where possible
131+
- Buffer pooling in renderers
132+
- Direct field access instead of method calls
133+
134+
### 2. Single-Pass Processing
135+
- Tokenization is single-pass
136+
- No multiple traversals of input text
137+
- Direct token-to-AST conversion
138+
139+
### 3. Focused Features
140+
- Only implement actually-used functionality
141+
- No complex tree operations unless needed
142+
- Disable unused extensions for zero overhead
143+
144+
## Intentional Limitations
145+
146+
These are **conscious decisions**, not oversights:
147+
148+
### 1. HTML Attributes
149+
**Current**: Basic HTML tags without attributes
150+
**Rationale**: Complex attribute parsing adds significant complexity for minimal benefit
151+
152+
### 2. Multi-Character Tokens
153+
**Current**: Single-character tokenization
154+
**Rationale**: Works for all supported markdown features, simpler implementation
155+
156+
### 3. Complex Tree Navigation
157+
**Current**: Direct field access only
158+
**Rationale**: No actual usage found in codebase analysis
159+
160+
### 4. Parsing Context
161+
**Current**: Stateless parsers
162+
**Rationale**: Sufficient for current feature set, much simpler
163+
164+
## Recent Improvements
165+
166+
### Fixed Blockquote Blank Lines (GitHub Issue #19)
167+
**Problem**: Blank lines in blockquotes weren't rendered correctly
168+
**Solution**: Enhanced `Blockquote.Restore()` to handle `LineBreak` nodes properly
169+
**Result**: Perfect preservation of blank lines in blockquotes
170+
171+
### Package Refactoring
172+
**Problem**: Everything was in `internal/` packages
173+
**Solution**: Moved key packages to public for extensibility
174+
**Result**: goldmark-style architecture with better extensibility
175+
176+
## Comparison with goldmark
177+
178+
| Aspect | goldmark | gomark |
179+
|--------|----------|--------|
180+
| **Complexity** | High | Low |
181+
| **Performance** | Good | Excellent |
182+
| **Extensibility** | Very High | Moderate |
183+
| **Maintainability** | Moderate | High |
184+
| **Learning Curve** | Steep | Gentle |
185+
| **Feature Set** | Comprehensive | Focused |
186+
187+
## When to Choose gomark
188+
189+
**Choose gomark when**:
190+
- You need fast, reliable markdown parsing
191+
- You want simple, maintainable code
192+
- You're building applications, not markdown libraries
193+
- You need good performance with moderate extensibility
194+
195+
**Choose goldmark when**:
196+
- You need maximum extensibility
197+
- You're building a markdown processing library
198+
- You need complex AST transformations
199+
- You need full CommonMark compliance edge cases
200+
201+
## Future Evolution
202+
203+
gomark is designed to evolve pragmatically:
204+
205+
1. **Add features only when needed**: No speculative features
206+
2. **Maintain simplicity**: New features shouldn't complicate existing code
207+
3. **Performance first**: New features shouldn't hurt performance
208+
4. **Backward compatibility**: Changes should be additive
209+
210+
### Potential Future Additions
211+
212+
**Only if there's demonstrated need**:
213+
- AST walking API (if users request it)
214+
- More output formats (if users request them)
215+
- Advanced HTML attributes (if simple approach proves insufficient)
216+
- Text-based parsing (if token-based proves limiting)
217+
218+
## Conclusion
219+
220+
gomark represents a **pragmatic approach** to markdown parsing:
221+
222+
- **Goldmark-inspired architecture** for familiarity and extensibility
223+
- **Performance-focused implementation** for real-world applications
224+
- **Simple, maintainable code** that developers can understand and modify
225+
- **Focused feature set** that solves real problems without over-engineering
226+
227+
This approach delivers excellent performance and maintainability while providing enough extensibility for most real-world use cases.

0 commit comments

Comments
 (0)