The Craft of Building a Handmade HTML Parser
The development log of ZMarkupParser HTML to NSAttributedString rendering engine
Tokenization conversion of HTML String, Normalization processing, generation of Abstract Syntax Tree, application of Visitor Pattern / Builder Pattern, and some miscellaneous discussions…
Continuation
Last year, I published an article titled “[ TL;DR ] Implementing iOS NSAttributedString HTML Render”, which briefly introduced how to use XMLParser to parse HTML and then convert it into NSAttributedString.Key. The structure and thought process in the article were quite disorganized, as it was a quick record of the issues encountered previously and I did not spend much time researching the topic.
Convert HTML String to NSAttributedString
Revisiting this topic, we need to be able to convert the HTML string provided by the API into NSAttributedString and apply the corresponding styles to display it in UITextView/UILabel.
e.g. <b>Test<a>Link</a></b>
should be displayed as Test Link
- Note 1 It is not recommended to use HTML as a communication and rendering medium between the App and data, as the HTML specification is too flexible. The App cannot support all HTML styles, and there is no official HTML conversion rendering engine.
- Note 2 Starting from iOS 14, you can use the native AttributedString to parse Markdown or introduce the apple/swift-markdown Swift Package to parse Markdown.
- Note 3 Due to the large scale of our company’s project and the long-term use of HTML as a medium, it is temporarily impossible to fully switch to Markdown or other Markup.
- Note 4 The HTML here is not intended to display the entire HTML webpage, but to use HTML as a style Markdown rendering string style. (To render a full page, complex HTML including images and tables, you still need to use WebView loadHTML)
It is strongly recommended to use Markdown as the string rendering medium language. If your project has the same dilemma as mine and you have no elegant tool to convert HTML to NSAttributedString, please use it.
Friends who remember the previous article can directly jump to the ZhgChgLi / ZMarkupParser section.
NSAttributedString.DocumentType.html
The methods for HTML to NSAttributedString found online all suggest directly using NSAttributedString’s built-in options to render HTML, as shown in the example below:
1
2
3
4
5
6
7
let htmlString = "<b>Test<a>Link</a></b>"
let data = htmlString.data(using: String.Encoding.utf8)!
let attributedOptions:[NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType :NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
let attributedString = try! NSAttributedString(data: data, options: attributedOptions, documentAttributes: nil)
The problem with this approach:
- Poor performance: This method uses WebView Core to render the style, then switches back to the Main Thread for UI display; rendering more than 300 characters takes 0.03 seconds.
- Text loss: For example, marketing copy might use
<Congratulation!>
which will be treated as an HTML tag and removed. - Lack of customization: For example, you cannot specify the boldness level of HTML bold tags in NSAttributedString.
- Intermittent crashes starting from iOS ≥ 12 with no official solution
- Frequent crashes in iOS 15, testing found that it crashes 100% under low battery conditions (fixed in iOS ≥ 15.2)
- Long strings cause crashes, testing shows that inputting strings longer than 54,600+ characters will crash 100% (EXC_BAD_ACCESS)
The most painful issue for us is the crash problem. From the release of iOS 15 to the fix in 15.2, our app was plagued by this issue. From the data, between 2022/03/11 and 2022/06/08, it caused over 2.4K crashes, affecting over 1.4K users.
This crash issue has existed since iOS 12, and iOS 15 just made it worse. I guess the fix in iOS 15.2 is just a patch, and the official solution cannot completely eradicate it.
The second issue is performance. As a string style Markup Language, it is heavily used in the app’s UILabel/UITextView. As mentioned earlier, one label takes 0.03 seconds, and multiplying this by the number of UILabel/UITextView in a list will cause noticeable lag in user interactions.
XMLParser
The second solution is introduced in the previous article, which uses XMLParser to parse into corresponding NSAttributedString keys and apply styles.
Refer to the implementation of SwiftRichString and the content of the previous article.
The previous article only explored using XMLParser to parse HTML and perform corresponding conversions, completing an experimental implementation, but it did not design it as a well-structured and extensible “tool.”
The problem with this approach:
- Zero tolerance for errors:
<br>
/<Congratulation!>
/<b>Bold<i>Bold+Italic</b>Italic</i>
These three possible HTML scenarios will cause XMLParser to throw an error and display blank. - Using XMLParser, the HTML string must fully comply with XML rules, unlike browsers or NSAttributedString.DocumentType.html which can tolerate and display correctly.
Standing on the shoulders of giants
Neither of the above two solutions can perfectly and elegantly solve the HTML problem, so I started searching for existing solutions.
- johnxnguyen / Down Only supports converting Markdown to Any (XML/NSAttributedString…), but does not support converting HTML.
- malcommac / SwiftRichString Uses XMLParser at its core, and testing shows it has the same zero tolerance for errors as mentioned earlier.
- scinfu / SwiftSoup Only supports HTML Parser (Selector) does not support converting to NSAttributedString.
After searching extensively, I found that the results are similar to the projects mentioned above. There are no giants’ shoulders to stand on.
ZhgChgLi/ZMarkupParser
Without the shoulders of giants, I had to become a giant myself, so I developed an HTML String to NSAttributedString tool.
Developed purely in Swift, it parses HTML Tags using Regex and performs Tokenization, analyzing and correcting Tag accuracy (fixing tags without an end & misplaced tags), then converts it into an abstract syntax tree. Finally, using the Visitor Pattern, it maps HTML Tags to abstract styles to get the final NSAttributedString result; it does not rely on any Parser Lib.
Features
- Supports HTML Render (to NSAttributedString) / Stripper (removing HTML Tags) / Selector functions
- Higher performance than
NSAttributedString.DocumentType.html
- Automatically analyzes and corrects Tag accuracy (fixing tags without an end & misplaced tags)
- Supports dynamic style settings from
style="color:red..."
- Supports custom style specifications, such as how bold bold should be
- Supports flexible extensibility for tags or custom tags and attributes
For detailed introduction, installation, and usage, refer to this article: ZMarkupParser HTML String to NSAttributedString Tool
You can directly git clone the project, then open the ZMarkupParser.xcworkspace
Project, select the ZMarkupParser-Demo
Target, and directly Build & Run to try it out.
Technical Details
Now, let’s dive into the technical details of developing this tool.
Overview of the operation process
The above image shows the general operation process, and the following article will introduce it step by step with code examples.
⚠️ This article will simplify Demo Code as much as possible, reduce abstraction and performance considerations, and focus on explaining the operation principles; for the final result, please refer to the project Source Code.
Code Implementation — Tokenization
a.k.a parser, parsing
When it comes to HTML rendering, the most important part is parsing. In the past, HTML was parsed as XML using XMLParser; however, it couldn’t handle the fact that HTML usage is not 100% XML, causing parser errors and inability to dynamically correct them.
After ruling out the use of XMLParser, the only option left in Swift was to use Regex for matching and parsing.
Initially, the idea was to use Regex to extract “paired” HTML Tags and recursively find HTML Tags layer by layer until the end; however, this couldn’t solve the problem of nested HTML Tags or support for misplaced tags. Therefore, we changed the strategy to extract “single” HTML Tags, recording whether they are Start Tags, Close Tags, or Self-Closing Tags, and combining other strings into a parsed result array.
Tokenization structure is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
enum HTMLParsedResult {
case start(StartItem) // <a>
case close(CloseItem) // </a>
case selfClosing(SelfClosingItem) // <br/>
case rawString(NSAttributedString)
}
extension HTMLParsedResult {
class SelfClosingItem {
let tagName: String
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
self.tagName = tagName
self.tagAttributedString = tagAttributedString
self.attributes = attributes
}
}
class StartItem {
let tagName: String
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
// Start Tag may be an abnormal HTML Tag or normal text e.g. <Congratulation!>, if found to be an isolated Start Tag after subsequent Normalization, it will be marked as True.
var isIsolated: Bool = false
init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
self.tagName = tagName
self.tagAttributedString = tagAttributedString
self.attributes = attributes
}
// Used for automatic padding correction in subsequent Normalization
func convertToCloseParsedItem() -> CloseItem {
return CloseItem(tagName: self.tagName)
}
// Used for automatic padding correction in subsequent Normalization
func convertToSelfClosingParsedItem() -> SelfClosingItem {
return SelfClosingItem(tagName: self.tagName, tagAttributedString: self.tagAttributedString, attributes: self.attributes)
}
}
class CloseItem {
let tagName: String
init(tagName: String) {
self.tagName = tagName
}
}
}
The regex used is as follows:
1
<(?:(?<closeTag>\/)?(?<tagName>[A-Za-z0-9]+)(?<tagAttributes>(?:\s*(\w+)\s*=\s*(["|']).*?\5)*)\s*(?<selfClosingTag>\/)?>)
- closeTag: Matches <
/
a> - tagName: Matches <
a
> or , </a
> - tagAttributes: Matches <a
href=”https://zhgchg.li” style=”color:red”
> - selfClosingTag: Matches <br
/
>
*This regex can still be optimized, will do it later.
Additional information about regex is provided in the latter part of the article, interested friends can refer to it.
Combining it all together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
var tokenizationResult: [HTMLParsedResult] = []
let expression = try? NSRegularExpression(pattern: pattern, options: expressionOptions)
let attributedString = NSAttributedString(string: "<a>Li<b>nk</a>Bold</b>")
let totalLength = attributedString.string.utf16.count // utf-16 support emoji
var lastMatch: NSTextCheckingResult?
// Start Tags Stack, First In Last Out (FILO)
// Check if the HTML string needs subsequent normalization to correct misalignment or add self-closing tags
var stackStartItems: [HTMLParsedResult.StartItem] = []
var needForamatter: Bool = false
expression.enumerateMatches(in: attributedString.string, range: NSMakeRange(0, totoalLength)) { match, _, _ in
if let match = match {
// Check the string between tags or before the first tag
// e.g. Test<a>Link</a>zzz<b>bold</b>Test2 - > Test,zzz
let lastMatchEnd = lastMatch?.range.upperBound ?? 0
let currentMatchStart = match.range.lowerBound
if currentMatchStart > lastMatchEnd {
let rawStringBetweenTag = attributedString.attributedSubstring(from: NSMakeRange(lastMatchEnd, (currentMatchStart - lastMatchEnd)))
tokenizationResult.append(.rawString(rawStringBetweenTag))
}
// <a href="https://zhgchg.li">, </a>
let matchAttributedString = attributedString.attributedSubstring(from: match.range)
// a, a
let matchTag = attributedString.attributedSubstring(from: match.range(withName: "tagName"))?.string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
// false, true
let matchIsEndTag = matchResult.attributedString(from: match.range(withName: "closeTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"
// href="https://zhgchg.li", nil
// Use regex to further extract HTML attributes, to [String: String], refer to the source code
let matchTagAttributes = parseAttributes(matchResult.attributedString(from: match.range(withName: "tagAttributes")))
// false, false
let matchIsSelfClosingTag = matchResult.attributedString(from: match.range(withName: "selfClosingTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"
if let matchAttributedString = matchAttributedString,
let matchTag = matchTag {
if matchIsSelfClosingTag {
// e.g. <br/>
tokenizationResult.append(.selfClosing(.init(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)))
} else {
// e.g. <a> or </a>
if matchIsEndTag {
// e.g. </a>
// Retrieve the position of the same tag name from the stack, starting from the last
if let index = stackStartItems.lastIndex(where: { $0.tagName == matchTag }) {
// If it's not the last one, it means there is a misalignment or a missing closing tag
if index != stackStartItems.count - 1 {
needForamatter = true
}
tokenizationResult.append(.close(.init(tagName: matchTag)))
stackStartItems.remove(at: index)
} else {
// Extra close tag e.g </a>
// Does not affect subsequent processing, just ignore
}
} else {
// e.g. <a>
let startItem: HTMLParsedResult.StartItem = HTMLParsedResult.StartItem(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)
tokenizationResult.append(.start(startItem))
// Add to stack
stackStartItems.append(startItem)
}
}
}
lastMatch = match
}
}
// Check the ending raw string
// e.g. Test<a>Link</a>Test2 - > Test2
if let lastMatch = lastMatch {
let currentIndex = lastMatch.range.upperBound
if totoalLength > currentIndex {
// There are remaining strings
let resetString = attributedString.attributedSubstring(from: NSMakeRange(currentIndex, (totoalLength - currentIndex)))
tokenizationResult.append(.rawString(resetString))
}
} else {
// lastMatch = nil, meaning no tags were found, all are plain text
let resetString = attributedString.attributedSubstring(from: NSMakeRange(0, totoalLength))
tokenizationResult.append(.rawString(resetString))
}
// Check if the stack is empty, if not, it means there are start tags without corresponding end tags
// Mark as isolated start tags
for stackStartItem in stackStartItems {
stackStartItem.isIsolated = true
needForamatter = true
}
print(tokenizationResult)
// [
// .start("a",["href":"https://zhgchg.li"])
// .rawString("Li")
// .start("b",nil)
// .rawString("nk")
// .close("a")
// .rawString("Bold")
// .close("b")
// ]
Operation flow as shown in the figure
The final result will be an array of Tokenization results.
Corresponding source code in HTMLStringToParsedResultProcessor.swift implementation
Normalization
a.k.a Formatter, normalization
After obtaining the preliminary parsing results in the previous step, if it is found during parsing that further normalization is needed, this step is required to automatically correct HTML Tag issues.
There are three types of HTML Tag issues:
- HTML Tag but missing Close Tag: e.g.,
<br>
- General text mistaken as HTML Tag: e.g.,
<Congratulation!>
- HTML Tag misalignment issues: e.g.,
<a>Li<b>nk</a>Bold</b>
The correction method is also very simple. We need to traverse the elements of the Tokenization results and try to fill in the gaps.
Operation flow as shown in the figure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
var normalizationResult = tokenizationResult
// Start Tags Stack, First In Last Out (FILO)
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
var itemIndex = 0
while itemIndex < newItems.count {
switch newItems[itemIndex] {
case .start(let item):
if item.isIsolated {
// If it is an isolated Start Tag
if WC3HTMLTagName(rawValue: item.tagName) == nil && (item.attributes?.isEmpty ?? true) {
// If it is not a WCS defined HTML Tag & has no HTML Attribute
// WC3HTMLTagName Enum can refer to Source Code
// Determine as general text mistaken as HTML Tag
// Change to raw string type
normalizationResult[itemIndex] = .rawString(item.tagAttributedString)
} else {
// Otherwise, change to self-closing tag, e.g., <br> -> <br/>
normalizationResult[itemIndex] = .selfClosing(item.convertToSelfClosingParsedItem())
}
itemIndex += 1
} else {
// Normal Start Tag, add to Stack
stackExpectedStartItems.append(item)
itemIndex += 1
}
case .close(let item):
// Encounter Close Tag
// Get the Tags between the Start Stack Tag and this Close Tag
// e.g., <a><u><b>[CurrentIndex]</a></u></b> -> interval 0
// e.g., <a><u><b>[CurrentIndex]</a></u></b> -> interval b,u
let reversedStackExpectedStartItems = Array(stackExpectedStartItems.reversed())
guard let reversedStackExpectedStartItemsOccurredIndex = reversedStackExpectedStartItems.firstIndex(where: { $0.tagName == item.tagName }) else {
itemIndex += 1
continue
}
let reversedStackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItems.prefix(upTo: reversedStackExpectedStartItemsOccurredIndex))
// Interval 0, means no tag misalignment
guard reversedStackExpectedStartItemsOccurred.count != 0 else {
// is pair, pop
stackExpectedStartItems.removeLast()
itemIndex += 1
continue
}
// There are other intervals, automatically fill in the interval Tags
// e.g., <a><u><b>[CurrentIndex]</a></u></b> ->
// e.g., <a><u><b>[CurrentIndex]</b></u></a><b></u></u></b>
let stackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItemsOccurred.reversed())
let afterItems = stackExpectedStartItemsOccurred.map({ HTMLParsedResult.start($0) })
let beforeItems = reversedStackExpectedStartItemsOccurred.map({ HTMLParsedResult.close($0.convertToCloseParsedItem()) })
normalizationResult.insert(contentsOf: afterItems, at: newItems.index(after: itemIndex))
normalizationResult.insert(contentsOf: beforeItems, at: itemIndex)
itemIndex = newItems.index(after: itemIndex) + stackExpectedStartItemsOccurred.count
// Update Start Stack Tags
// e.g., -> b,u
stackExpectedStartItems.removeAll { startItem in
return reversedStackExpectedStartItems.prefix(through: reversedStackExpectedStartItemsOccurredIndex).contains(where: { $0 === startItem })
}
case .selfClosing, .rawString:
itemIndex += 1
}
}
print(normalizationResult)
// [
// .start("a",["href":"https://zhgchg.li"])
// .rawString("Li")
// .start("b",nil)
// .rawString("nk")
// .close("b")
// .close("a")
// .start("b",nil)
// .rawString("Bold")
// .close("b")
// ]
Corresponding implementation in the source code HTMLParsedResultFormatterProcessor.swift
Abstract Syntax Tree
a.k.a AST, Abstract Tree
After the Tokenization & Normalization data preprocessing is completed, the result needs to be converted into an abstract tree 🌲.
As shown in the figure
Converting into an abstract tree facilitates our future operations and extensions, such as implementing Selector functionality or other conversions like HTML to Markdown; or if we want to add Markdown to NSAttributedString in the future, we only need to implement Markdown’s Tokenization & Normalization to complete it.
First, we define a Markup Protocol with Child & Parent properties to record the information of leaves and branches:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
protocol Markup: AnyObject {
var parentMarkup: Markup? { get set }
var childMarkups: [Markup] { get set }
func appendChild(markup: Markup)
func prependChild(markup: Markup)
func accept<V: MarkupVisitor>(_ visitor: V) -> V.Result
}
extension Markup {
func appendChild(markup: Markup) {
markup.parentMarkup = self
childMarkups.append(markup)
}
func prependChild(markup: Markup) {
markup.parentMarkup = self
childMarkups.insert(markup, at: 0)
}
}
Additionally, using the Visitor Pattern, each style attribute is defined as an object Element, and different Visit strategies are used to obtain individual application results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
protocol MarkupVisitor {
associatedtype Result
func visit(markup: Markup) -> Result
func visit(_ markup: RootMarkup) -> Result
func visit(_ markup: RawStringMarkup) -> Result
func visit(_ markup: BoldMarkup) -> Result
func visit(_ markup: LinkMarkup) -> Result
//...
}
extension MarkupVisitor {
func visit(markup: Markup) -> Result {
return markup.accept(self)
}
}
Basic Markup nodes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Root node
final class RootMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
// Leaf node
final class RawStringMarkup: Markup {
let attributedString: NSAttributedString
init(attributedString: NSAttributedString) {
self.attributedString = attributedString
}
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
Define Markup Style Nodes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Branch nodes:
// Link style
final class LinkMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
// Bold style
final class BoldMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
Corresponding implementation in the source code Markup
Before converting to an abstract tree, we also need…
MarkupComponent
Because our tree structure does not depend on any data structure (for example, a node/LinkMarkup should have URL information to perform subsequent rendering). For this, we define a container to store tree nodes and related data information:
1
2
3
4
5
6
7
8
9
10
11
12
13
protocol MarkupComponent {
associatedtype T
var markup: Markup { get }
var value: T { get }
init(markup: Markup, value: T)
}
extension Sequence where Iterator.Element: MarkupComponent {
func value(markup: Markup) -> Element.T? {
return self.first(where:{ $0.markup === markup })?.value as? Element.T
}
}
Corresponding implementation in the source code MarkupComponent
You can also declare Markup as Hashable
and directly use Dictionary to store values [Markup: Any]
, but in this way, Markup cannot be used as a general type and needs to be prefixed with any Markup
.
HTMLTag & HTMLTagName & HTMLTagNameVisitor
We also abstracted the HTML Tag Name part, allowing users to decide which tags need to be processed and facilitating future extensions. For example, the <strong>
Tag Name can correspond to BoldMarkup
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public protocol HTMLTagName {
var string: String { get }
func accept<V: HTMLTagNameVisitor>(_ visitor: V) -> V.Result
}
public struct A_HTMLTagName: HTMLTagName {
public let string: String = WC3HTMLTagName.a.rawValue
public init() {
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
public struct B_HTMLTagName: HTMLTagName {
public let string: String = WC3HTMLTagName.b.rawValue
public init() {
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
Corresponding implementation in the source code HTMLTagNameVisitor
Additionally, refer to the W3C wiki which lists the HTML tag name enum: WC3HTMLTagName.swift
HTMLTag is simply a container object because we want to allow external specification of the style corresponding to the HTML Tag, so we declare a container to put them together:
1
2
3
4
5
6
7
8
9
struct HTMLTag {
let tagName: HTMLTagName
let customStyle: MarkupStyle? // Render will be explained later
init(tagName: HTMLTagName, customStyle: MarkupStyle? = nil) {
self.tagName = tagName
self.customStyle = customStyle
}
}
Corresponding implementation in the source code HTMLTag
HTMLTagNameToHTMLMarkupVisitor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct HTMLTagNameToMarkupVisitor: HTMLTagNameVisitor {
typealias Result = Markup
let attributes: [String: String]?
func visit(_ tagName: A_HTMLTagName) -> Result {
return LinkMarkup()
}
func visit(_ tagName: B_HTMLTagName) -> Result {
return BoldMarkup()
}
//...
}
Corresponding implementation in the source code HTMLTagNameToHTMLMarkupVisitor
Convert to Abstract Tree with HTML Data
We need to convert the result of the normalized HTML data into an abstract tree. First, declare a MarkupComponent data structure that can store HTML data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct HTMLElementMarkupComponent: MarkupComponent {
struct HTMLElement {
let tag: HTMLTag
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
}
typealias T = HTMLElement
let markup: Markup
let value: HTMLElement
init(markup: Markup, value: HTMLElement) {
self.markup = markup
self.value = value
}
}
Convert to Markup Abstract Tree:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
var htmlElementComponents: [HTMLElementMarkupComponent] = []
let rootMarkup = RootMarkup()
var currentMarkup: Markup = rootMarkup
let htmlTags: [String: HTMLTag]
init(htmlTags: [HTMLTag]) {
self.htmlTags = Dictionary(uniqueKeysWithValues: htmlTags.map{ ($0.tagName.string, $0) })
}
// Start Tags Stack, ensure correct pop tag
// Normalization has already been done before, it should not go wrong, just to ensure
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
for thisItem in from {
switch thisItem {
case .start(let item):
let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
// Use Visitor to ask for the corresponding Markup
let markup = visitor.visit(tagName: htmlTag.tagName)
// Add itself to the current branch's leaf node
// Itself becomes the current branch node
htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
currentMarkup.appendChild(markup: markup)
currentMarkup = markup
stackExpectedStartItems.append(item)
case .selfClosing(let item):
// Directly add to the current branch's leaf node
let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
let markup = visitor.visit(tagName: htmlTag.tagName)
htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
currentMarkup.appendChild(markup: markup)
case .close(let item):
if let lastTagName = stackExpectedStartItems.popLast()?.tagName,
lastTagName == item.tagName {
// When encountering Close Tag, return to the previous level
currentMarkup = currentMarkup.parentMarkup ?? currentMarkup
}
case .rawString(let attributedString):
// Directly add to the current branch's leaf node
currentMarkup.appendChild(markup: RawStringMarkup(attributedString: attributedString))
}
}
// print(htmlElementComponents)
// [(markup: LinkMarkup, (tag: a, attributes: ["href":"zhgchg.li"]...)]
Operation result as shown in the figure
Corresponding source code implementation in HTMLParsedResultToHTMLElementWithRootMarkupProcessor.swift
At this point, we have actually completed the functionality of the Selector 🎉
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class HTMLSelector: CustomStringConvertible {
let markup: Markup
let componets: [HTMLElementMarkupComponent]
init(markup: Markup, componets: [HTMLElementMarkupComponent]) {
self.markup = markup
self.componets = componets
}
public func filter(_ htmlTagName: String) -> [HTMLSelector] {
let result = markup.childMarkups.filter({ componets.value(markup: $0)?.tag.tagName.isEqualTo(htmlTagName) ?? false })
return result.map({ .init(markup: $0, componets: componets) })
}
//...
}
We can filter leaf node objects layer by layer.
Corresponding source code implementation in HTMLSelector
Parser — HTML to MarkupStyle (Abstract of NSAttributedString.Key)
Next, we need to complete the conversion of HTML to MarkupStyle (NSAttributedString.Key).
NSAttributedString sets the text style through NSAttributedString.Key Attributes. We abstract all fields of NSAttributedString.Key to correspond to MarkupStyle, MarkupStyleColor, MarkupStyleFont, MarkupStyleParagraphStyle.
Purpose:
- The original data structure of Attributes is
[NSAttributedString.Key: Any?]
. If exposed directly, it is difficult to control the values users input, and incorrect values may cause crashes, such as.font: 123
. - Styles need to be inheritable, such as
<a><b>test</b></a>
, where the style of the test string inherits from the link’s bold (bold+link); if the Dictionary is exposed directly, it is difficult to control the inheritance rules. - Encapsulate iOS/macOS (UIKit/Appkit) related objects.
MarkupStyle Struct
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
public struct MarkupStyle {
public var font:MarkupStyleFont
public var paragraphStyle:MarkupStyleParagraphStyle
public var foregroundColor:MarkupStyleColor? = nil
public var backgroundColor:MarkupStyleColor? = nil
public var ligature:NSNumber? = nil
public var kern:NSNumber? = nil
public var tracking:NSNumber? = nil
public var strikethroughStyle:NSUnderlineStyle? = nil
public var underlineStyle:NSUnderlineStyle? = nil
public var strokeColor:MarkupStyleColor? = nil
public var strokeWidth:NSNumber? = nil
public var shadow:NSShadow? = nil
public var textEffect:String? = nil
public var attachment:NSTextAttachment? = nil
public var link:URL? = nil
public var baselineOffset:NSNumber? = nil
public var underlineColor:MarkupStyleColor? = nil
public var strikethroughColor:MarkupStyleColor? = nil
public var obliqueness:NSNumber? = nil
public var expansion:NSNumber? = nil
public var writingDirection:NSNumber? = nil
public var verticalGlyphForm:NSNumber? = nil
//...
// Inherited from...
// Default: When the field is nil, fill in the current data object from 'from'
mutating func fillIfNil(from: MarkupStyle?) {
guard let from = from else { return }
var currentFont = self.font
currentFont.fillIfNil(from: from.font)
self.font = currentFont
var currentParagraphStyle = self.paragraphStyle
currentParagraphStyle.fillIfNil(from: from.paragraphStyle)
self.paragraphStyle = currentParagraphStyle
//..
}
// MarkupStyle to NSAttributedString.Key: Any
func render() -> [NSAttributedString.Key: Any] {
var data: [NSAttributedString.Key: Any] = [:]
if let font = font.getFont() {
data[.font] = font
}
if let ligature = self.ligature {
data[.ligature] = ligature
}
//...
return data
}
}
public struct MarkupStyleFont: MarkupStyleItem {
public enum FontWeight {
case style(FontWeightStyle)
case rawValue(CGFloat)
}
public enum FontWeightStyle: String {
case ultraLight, light, thin, regular, medium, semibold, bold, heavy, black
// ...
}
public var size: CGFloat?
public var weight: FontWeight?
public var italic: Bool?
//...
}
public struct MarkupStyleParagraphStyle: MarkupStyleItem {
public var lineSpacing:CGFloat? = nil
public var paragraphSpacing:CGFloat? = nil
public var alignment:NSTextAlignment? = nil
public var headIndent:CGFloat? = nil
public var tailIndent:CGFloat? = nil
public var firstLineHeadIndent:CGFloat? = nil
public var minimumLineHeight:CGFloat? = nil
public var maximumLineHeight:CGFloat? = nil
public var lineBreakMode:NSLineBreakMode? = nil
public var baseWritingDirection:NSWritingDirection? = nil
public var lineHeightMultiple:CGFloat? = nil
public var paragraphSpacingBefore:CGFloat? = nil
public var hyphenationFactor:Float? = nil
public var usesDefaultHyphenation:Bool? = nil
public var tabStops: [NSTextTab]? = nil
public var defaultTabInterval:CGFloat? = nil
public var textLists: [NSTextList]? = nil
public var allowsDefaultTighteningForTruncation:Bool? = nil
public var lineBreakStrategy: NSParagraphStyle.LineBreakStrategy? = nil
//...
}
public struct MarkupStyleColor {
let red: Int
let green: Int
let blue: Int
let alpha: CGFloat
//...
}
Corresponding implementation in the source code MarkupStyle
Additionally, refer to the W3c wiki, browser predefined color name enumerates the corresponding color name text & color R,G,B enum: MarkupStyleColorName.swift
HTMLTagStyleAttribute & HTMLTagStyleAttributeVisitor
Let’s talk a bit more about these two objects because HTML Tags are allowed to be styled using CSS settings; for this, we abstract the HTMLTagName and apply it once again to the HTML Style Attribute.
For example, HTML might provide: <a style=”color:red;font-size:14px”>RedLink</a>
, which means this link should be set to red and size 14px.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public protocol HTMLTagStyleAttribute {
var styleName: String { get }
func accept<V: HTMLTagStyleAttributeVisitor>(_ visitor: V) -> V.Result
}
public protocol HTMLTagStyleAttributeVisitor {
associatedtype Result
func visit(styleAttribute: HTMLTagStyleAttribute) -> Result
func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result
func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result
//...
}
public extension HTMLTagStyleAttributeVisitor {
func visit(styleAttribute: HTMLTagStyleAttribute) -> Result {
return styleAttribute.accept(self)
}
}
Corresponding implementation in the source code HTMLTagStyleAttribute
HTMLTagStyleAttributeToMarkupStyleVisitor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct HTMLTagStyleAttributeToMarkupStyleVisitor: HTMLTagStyleAttributeVisitor {
typealias Result = MarkupStyle?
let value: String
func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result {
// Regex to extract Color Hex or Mapping from HTML Pre-defined Color Name, please refer to the Source Code
guard let color = MarkupStyleColor(string: value) else { return nil }
return MarkupStyle(foregroundColor: color)
}
func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result {
// Regex to extract 10px -> 10, please refer to the Source Code
guard let size = self.convert(fromPX: value) else { return nil }
return MarkupStyle(font: MarkupStyleFont(size: CGFloat(size)))
}
// ...
}
Corresponding implementation in the source code HTMLTagAttributeToMarkupStyleVisitor.swift
init’s value = attribute’s value, converted to the corresponding MarkupStyle field according to the visit type.
HTMLElementMarkupComponentMarkupStyleVisitor
After introducing the MarkupStyle object, we need to convert the result of Normalization’s HTMLElementComponents into MarkupStyle.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// MarkupStyle policy
public enum MarkupStylePolicy {
case respectMarkupStyleFromCode // Prioritize from Code, fill in with HTML Style Attribute
case respectMarkupStyleFromHTMLStyleAttribute // Prioritize from HTML Style Attribute, fill in with Code
}
struct HTMLElementMarkupComponentMarkupStyleVisitor: MarkupVisitor {
typealias Result = MarkupStyle?
let policy: MarkupStylePolicy
let components: [HTMLElementMarkupComponent]
let styleAttributes: [HTMLTagStyleAttribute]
func visit(_ markup: BoldMarkup) -> Result {
// .bold is just a default style defined in MarkupStyle, please refer to the Source Code
return defaultVisit(components.value(markup: markup), defaultStyle: .bold)
}
func visit(_ markup: LinkMarkup) -> Result {
// .link is just a default style defined in MarkupStyle, please refer to the Source Code
var markupStyle = defaultVisit(components.value(markup: markup), defaultStyle: .link) ?? .link
// Get the HtmlElement corresponding to LinkMarkup from HtmlElementComponents
// Find the href parameter from the attributes of HtmlElement (HTML carries URL String)
if let href = components.value(markup: markup)?.attributes?["href"] as? String,
let url = URL(string: href) {
markupStyle.link = url
}
return markupStyle
}
// ...
}
extension HTMLElementMarkupComponentMarkupStyleVisitor {
// Get the custom MarkupStyle specified in the HTMLTag container
private func customStyle(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?) -> MarkupStyle? {
guard let customStyle = htmlElement?.tag.customStyle else {
return nil
}
return customStyle
}
// Default action
func defaultVisit(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?, defaultStyle: MarkupStyle? = nil) -> Result {
var markupStyle: MarkupStyle? = customStyle(htmlElement) ?? defaultStyle
// Get the HtmlElement corresponding to LinkMarkup from HtmlElementComponents
// Check if the attributes of HtmlElement have a `Style` Attribute
guard let styleString = htmlElement?.attributes?["style"],
styleAttributes.count > 0 else {
// No
return markupStyle
}
// Has Style Attributes
// Split the Style Value string into an array
// font-size:14px;color:red -> ["font-size":"14px","color":"red"]
let styles = styleString.split(separator: ";").filter { $0.trimmingCharacters(in: .whitespacesAndNewlines) != "" }.map { $0.split(separator: ":") }
for style in styles {
guard style.count == 2 else {
continue
}
// e.g font-size
let key = style[0].trimmingCharacters(in: .whitespacesAndNewlines)
// e.g. 14px
let value = style[1].trimmingCharacters(in: .whitespacesAndNewlines)
if let styleAttribute = styleAttributes.first(where: { $0.isEqualTo(styleName: key) }) {
// Use the HTMLTagStyleAttributeToMarkupStyleVisitor mentioned above to convert back to MarkupStyle
let visitor = HTMLTagStyleAttributeToMarkupStyleVisitor(value: value)
if var thisMarkupStyle = visitor.visit(styleAttribute: styleAttribute) {
// When Style Attribute has a return value..
// Merge the result of the previous MarkupStyle
thisMarkupStyle.fillIfNil(from: markupStyle)
markupStyle = thisMarkupStyle
}
}
}
// If there is a default Style
if var defaultStyle = defaultStyle {
switch policy {
case .respectMarkupStyleFromHTMLStyleAttribute:
// Prioritize Style Attribute MarkupStyle, then
// Merge the result of defaultStyle
markupStyle?.fillIfNil(from: defaultStyle)
case .respectMarkupStyleFromCode:
// Prioritize defaultStyle, then
// Merge the result of Style Attribute MarkupStyle
defaultStyle.fillIfNil(from: markupStyle)
markupStyle = defaultStyle
}
}
return markupStyle
}
}
Corresponding implementation in the source code HTMLTagAttributeToMarkupStyleVisitor.swift
We will define some default styles in MarkupStyle. Some Markup will use the default style if the desired style is not specified from outside the code.
There are two style inheritance strategies:
- respectMarkupStyleFromCode: Use the default style as the primary; then see what styles can be supplemented from the Style Attributes, ignoring if there is already a value.
- respectMarkupStyleFromHTMLStyleAttribute: Use the Style Attributes as the primary; then see what styles can be supplemented from the default style, ignoring if there is already a value.
HTMLElementWithMarkupToMarkupStyleProcessor
Convert the Normalization result into AST & MarkupStyleComponent.
Declare a new MarkupComponent to store the corresponding MarkupStyle:
1
2
3
4
5
6
7
8
9
10
struct MarkupStyleComponent: MarkupComponent {
typealias T = MarkupStyle
let markup: Markup
let value: MarkupStyle
init(markup: Markup, value: MarkupStyle) {
self.markup = markup
self.value = value
}
}
Simple traversal of the Markup Tree & HTMLElementMarkupComponent structure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let styleAttributes: [HTMLTagStyleAttribute]
let policy: MarkupStylePolicy
func process(from: (Markup, [HTMLElementMarkupComponent])) -> [MarkupStyleComponent] {
var components: [MarkupStyleComponent] = []
let visitor = HTMLElementMarkupComponentMarkupStyleVisitor(policy: policy, components: from.1, styleAttributes: styleAttributes)
walk(markup: from.0, visitor: visitor, components: &components)
return components
}
func walk(markup: Markup, visitor: HTMLElementMarkupComponentMarkupStyleVisitor, components: inout [MarkupStyleComponent]) {
if let markupStyle = visitor.visit(markup: markup) {
components.append(.init(markup: markup, value: markupStyle))
}
for markup in markup.childMarkups {
walk(markup: markup, visitor: visitor, components: &components)
}
}
// print(components)
// [(markup: LinkMarkup, MarkupStyle(link: https://zhgchg.li, color: .blue)]
// [(markup: BoldMarkup, MarkupStyle(font: .init(weight: .bold))]
Corresponding implementation in the original code HTMLElementWithMarkupToMarkupStyleProcessor.swift
The process result is shown in the above image
Render — Convert To NSAttributedString
Now that we have the HTML Tag abstract tree structure and the MarkupStyle corresponding to the HTML Tag, the final step is to produce the final NSAttributedString rendering result.
MarkupNSAttributedStringVisitor
visit markup to NSAttributedString
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
struct MarkupNSAttributedStringVisitor: MarkupVisitor {
typealias Result = NSAttributedString
let components: [MarkupStyleComponent]
// root / base MarkupStyle, specified externally, for example, the size of the entire string
let rootStyle: MarkupStyle?
func visit(_ markup: RootMarkup) -> Result {
// Look down to the RawString object
return collectAttributedString(markup)
}
func visit(_ markup: RawStringMarkup) -> Result {
// Return Raw String
// Collect all MarkupStyles in the chain
// Apply Style to NSAttributedString
return applyMarkupStyle(markup.attributedString, with: collectMarkupStyle(markup))
}
func visit(_ markup: BoldMarkup) -> Result {
// Look down to the RawString object
return collectAttributedString(markup)
}
func visit(_ markup: LinkMarkup) -> Result {
// Look down to the RawString object
return collectAttributedString(markup)
}
// ...
}
private extension MarkupNSAttributedStringVisitor {
// Apply Style to NSAttributedString
func applyMarkupStyle(_ attributedString: NSAttributedString, with markupStyle: MarkupStyle?) -> NSAttributedString {
guard let markupStyle = markupStyle else { return attributedString }
let mutableAttributedString = NSMutableAttributedString(attributedString: attributedString)
mutableAttributedString.addAttributes(markupStyle.render(), range: NSMakeRange(0, mutableAttributedString.string.utf16.count))
return mutableAttributedString
}
func collectAttributedString(_ markup: Markup) -> NSMutableAttributedString {
// collect from downstream
// Root -> Bold -> String("Bold")
// \
// > String("Test")
// Result: Bold Test
// Recursively visit and combine the final NSAttributedString by looking down layer by layer for raw strings
return markup.childMarkups.compactMap({ visit(markup: $0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
partialResult.append(attributedString)
return partialResult
}
}
func collectMarkupStyle(_ markup: Markup) -> MarkupStyle? {
// collect from upstream
// String("Test") -> Bold -> Italic -> Root
// Result: style: Bold+Italic
// Inherit styles layer by layer by looking up for parent tag's markupstyle
var currentMarkup: Markup? = markup.parentMarkup
var currentStyle = components.value(markup: markup)
while let thisMarkup = currentMarkup {
guard let thisMarkupStyle = components.value(markup: thisMarkup) else {
currentMarkup = thisMarkup.parentMarkup
continue
}
if var thisCurrentStyle = currentStyle {
thisCurrentStyle.fillIfNil(from: thisMarkupStyle)
currentStyle = thisCurrentStyle
} else {
currentStyle = thisMarkupStyle
}
currentMarkup = thisMarkup.parentMarkup
}
if var currentStyle = currentStyle {
currentStyle.fillIfNil(from: rootStyle)
return currentStyle
} else {
return rootStyle
}
}
}
Corresponding implementation in the source code MarkupNSAttributedStringVisitor.swift
Operation process and result as shown in the figure
Finally, we can get:
1
2
3
4
5
6
7
8
9
10
11
Li{
NSColor = "Blue";
NSFont = "<UICTFont: 0x145d17600> font-family: \".SFUI-Regular\"; font-weight: normal; font-style: normal; font-size: 13.00pt";
NSLink = "https://zhgchg.li";
}nk{
NSColor = "Blue";
NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
NSLink = "https://zhgchg.li";
}Bold{
NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
}
🎉🎉🎉🎉Completed🎉🎉🎉🎉
At this point, we have completed the entire conversion process from HTML String to NSAttributedString.
Stripper — Stripping HTML Tags
Stripping HTML tags is relatively simple, just need to:
1
2
3
4
5
6
7
8
9
10
func attributedString(_ markup: Markup) -> NSAttributedString {
if let rawStringMarkup = markup as? RawStringMarkup {
return rawStringMarkup.attributedString
} else {
return markup.childMarkups.compactMap({ attributedString($0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
partialResult.append(attributedString)
return partialResult
}
}
}
Corresponding implementation in the source code MarkupStripperProcessor.swift
Similar to Render, but purely returns the content after finding RawStringMarkup.
Extend — Dynamic Extension
To extend and cover all HTMLTag/Style Attributes, a dynamic extension port is opened, making it convenient to dynamically extend objects directly from the code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public struct ExtendTagName: HTMLTagName {
public let string: String
public init(_ w3cHTMLTagName: WC3HTMLTagName) {
self.string = w3cHTMLTagName.rawValue
}
public init(_ string: String) {
self.string = string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
// to
final class ExtendMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
//----
public struct ExtendHTMLTagStyleAttribute: HTMLTagStyleAttribute {
public let styleName: String
public let render: ((String) -> (MarkupStyle?)) // Dynamically change MarkupStyle using closure
public init(styleName: String, render: @escaping ((String) -> (MarkupStyle?))) {
self.styleName = styleName
self.render = render
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagStyleAttributeVisitor {
return visitor.visit(self)
}
}
ZHTMLParserBuilder
Finally, we use the Builder Pattern to allow external Modules to quickly construct the objects required by ZMarkupParser and ensure Access Level Control.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
public final class ZHTMLParserBuilder {
private(set) var htmlTags: [HTMLTag] = []
private(set) var styleAttributes: [HTMLTagStyleAttribute] = []
private(set) var rootStyle: MarkupStyle?
private(set) var policy: MarkupStylePolicy = .respectMarkupStyleFromCode
public init() {
}
public static func initWithDefault() -> Self {
var builder = Self.init()
for htmlTagName in ZHTMLParserBuilder.htmlTagNames {
builder = builder.add(htmlTagName)
}
for styleAttribute in ZHTMLParserBuilder.styleAttributes {
builder = builder.add(styleAttribute)
}
return builder
}
public func set(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle?) -> Self {
return self.add(htmlTagName, withCustomStyle: markupStyle)
}
public func add(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle? = nil) -> Self {
// Only one tagName can exist
htmlTags.removeAll { htmlTag in
return htmlTag.tagName.string == htmlTagName.string
}
htmlTags.append(HTMLTag(tagName: htmlTagName, customStyle: markupStyle))
return self
}
public func add(_ styleAttribute: HTMLTagStyleAttribute) -> Self {
styleAttributes.removeAll { thisStyleAttribute in
return thisStyleAttribute.styleName == styleAttribute.styleName
}
styleAttributes.append(styleAttribute)
return self
}
public func set(rootStyle: MarkupStyle) -> Self {
self.rootStyle = rootStyle
return self
}
public func set(policy: MarkupStylePolicy) -> Self {
self.policy = policy
return self
}
public func build() -> ZHTMLParser {
// ZHTMLParser init is only open for internal use, external cannot directly init
// Can only be initialized through ZHTMLParserBuilder
return ZHTMLParser(htmlTags: htmlTags, styleAttributes: styleAttributes, policy: policy, rootStyle: rootStyle)
}
}
Corresponding implementation in ZHTMLParserBuilder.swift
initWithDefault will add all implemented HTMLTagName/Style Attribute by default
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public extension ZHTMLParserBuilder {
static var htmlTagNames: [HTMLTagName] {
return [
A_HTMLTagName(),
B_HTMLTagName(),
BR_HTMLTagName(),
DIV_HTMLTagName(),
HR_HTMLTagName(),
I_HTMLTagName(),
LI_HTMLTagName(),
OL_HTMLTagName(),
P_HTMLTagName(),
SPAN_HTMLTagName(),
STRONG_HTMLTagName(),
U_HTMLTagName(),
UL_HTMLTagName(),
DEL_HTMLTagName(),
TR_HTMLTagName(),
TD_HTMLTagName(),
TH_HTMLTagName(),
TABLE_HTMLTagName(),
IMG_HTMLTagName(handler: nil),
// ...
]
}
}
public extension ZHTMLParserBuilder {
static var styleAttributes: [HTMLTagStyleAttribute] {
return [
ColorHTMLTagStyleAttribute(),
BackgroundColorHTMLTagStyleAttribute(),
FontSizeHTMLTagStyleAttribute(),
FontWeightHTMLTagStyleAttribute(),
LineHeightHTMLTagStyleAttribute(),
WordSpacingHTMLTagStyleAttribute(),
// ...
]
}
}
ZHTMLParser init is only open internally, external cannot directly init, can only init through ZHTMLParserBuilder.
ZHTMLParser encapsulates Render/Selector/Stripper operations:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
public final class ZHTMLParser: ZMarkupParser {
let htmlTags: [HTMLTag]
let styleAttributes: [HTMLTagStyleAttribute]
let rootStyle: MarkupStyle?
internal init(...) {
}
// Get link style attributes
public var linkTextAttributes: [NSAttributedString.Key: Any] {
// ...
}
public func selector(_ string: String) -> HTMLSelector {
// ...
}
public func selector(_ attributedString: NSAttributedString) -> HTMLSelector {
// ...
}
public func render(_ string: String) -> NSAttributedString {
// ...
}
// Allow rendering of NSAttributedString within nodes using HTMLSelector results
public func render(_ selector: HTMLSelector) -> NSAttributedString {
// ...
}
public func render(_ attributedString: NSAttributedString) -> NSAttributedString {
// ...
}
public func stripper(_ string: String) -> String {
// ...
}
public func stripper(_ attributedString: NSAttributedString) -> NSAttributedString {
// ...
}
// ...
}
Corresponding implementation in the original code ZHTMLParser.swift
UIKit Issues
The result of NSAttributedString is most commonly displayed in a UITextView, but note:
- The link style in UITextView is uniformly determined by the
linkTextAttributes
setting, not by the NSAttributedString.Key setting, and individual styles cannot be set; hence theZMarkupParser.linkTextAttributes
opening. - UILabel currently has no way to change the link style, and since UILabel does not have TextStorage, if you want to load NSTextAttachment images, you need to handle UILabel separately.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public extension UITextView {
func setHtmlString(_ string: String, with parser: ZHTMLParser) {
self.setHtmlString(NSAttributedString(string: string), with: parser)
}
func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
self.attributedText = parser.render(string)
self.linkTextAttributes = parser.linkTextAttributes
}
}
public extension UILabel {
func setHtmlString(_ string: String, with parser: ZHTMLParser) {
self.setHtmlString(NSAttributedString(string: string), with: parser)
}
func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
let attributedString = parser.render(string)
attributedString.enumerateAttribute(NSAttributedString.Key.attachment, in: NSMakeRange(0, attributedString.string.utf16.count), options: []) { (value, effectiveRange, nil) in
guard let attachment = value as? ZNSTextAttachment else {
return
}
attachment.register(self)
}
self.attributedText = attributedString
}
}
Therefore, by extending UIKit, external users only need to use setHTMLString()
to complete the binding.
Complex Rendering Items— List Items
Record of implementing list items.
Using <ol>
/ <ul>
to wrap <li>
in HTML to represent list items:
1
2
3
4
5
6
<ul>
<li>ItemA</li>
<li>ItemB</li>
<li>ItemC</li>
//...
</ul>
Using the same parsing method as before, we can get other list items in visit(_ markup: ListItemMarkup)
to know the current list index (thanks to converting to AST).
1
2
3
4
func visit(_ markup: ListItemMarkup) -> Result {
let siblingListItems = markup.parentMarkup?.childMarkups.filter({ $0 is ListItemMarkup }) ?? []
let position = (siblingListItems.firstIndex(where: { $0 === markup }) ?? 0)
}
NSParagraphStyle has an NSTextList object that can be used to display list items, but in practice, it cannot customize the width of the whitespace (personally, I think the whitespace is too large). If there is whitespace between the bullet and the string, it will trigger a line break here, making the display look a bit odd, as shown in the image below:
The Better part can potentially be achieved by setting headIndent, firstLineHeadIndent, NSTextTab, but testing shows that if the string is too long or the size changes, it still cannot perfectly present the result.
Currently, it is only Acceptable, combining the list item string and inserting it before the string.
We only use NSTextList.MarkerFormat to generate list item symbols, rather than directly using NSTextList.
For a list of supported list symbols, refer to: MarkupStyleList.swift
Final display result: <ol><li>
Complex Rendering Items — Table
Similar to the implementation of list items, but for tables.
Using <table>
in HTML to create a table -> wrapping <tr>
table rows -> wrapping <td>/<th>
to represent table cells:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
Testing shows that the native NSAttributedString.DocumentType.html
uses the Private macOS API NSTextBlock
to complete the display, thus it can fully display the HTML table style and content.
A bit of cheating! We can’t use Private API 🥲
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
func visit(_ markup: TableColumnMarkup) -> Result {
let attributedString = collectAttributedString(markup)
let siblingColumns = markup.parentMarkup?.childMarkups.filter({ $0 is TableColumnMarkup }) ?? []
let position = (siblingColumns.firstIndex(where: { $0 === markup }) ?? 0)
// Whether to specify the desired width externally, can set .max to not truncate string
var maxLength: Int? = markup.fixedMaxLength
if maxLength == nil {
// If not specified, find the string length of the same column in the first row as the max length
if let tableRowMarkup = markup.parentMarkup as? TableRowMarkup,
let firstTableRow = tableRowMarkup.parentMarkup?.childMarkups.first(where: { $0 is TableRowMarkup }) as? TableRowMarkup {
let firstTableRowColumns = firstTableRow.childMarkups.filter({ $0 is TableColumnMarkup })
if firstTableRowColumns.indices.contains(position) {
let firstTableRowColumnAttributedString = collectAttributedString(firstTableRowColumns[position])
let length = firstTableRowColumnAttributedString.string.utf16.count
maxLength = length
}
}
}
if let maxLength = maxLength {
// If the field exceeds maxLength, truncate the string
if attributedString.string.utf16.count > maxLength {
attributedString.mutableString.setString(String(attributedString.string.prefix(maxLength))+"...")
} else {
attributedString.mutableString.setString(attributedString.string.padding(toLength: maxLength, withPad: " ", startingAt: 0))
}
}
if position < siblingColumns.count - 1 {
// Add spaces as spacing, the width of the spacing can be specified externally
attributedString.append(makeString(in: markup, string: String(repeating: " ", count: markup.spacing)))
}
return attributedString
}
func visit(_ markup: TableRowMarkup) -> Result {
let attributedString = collectAttributedString(markup)
attributedString.append(makeBreakLine(in: markup)) // Add line break, for details refer to Source Code
return attributedString
}
func visit(_ markup: TableMarkup) -> Result {
let attributedString = collectAttributedString(markup)
attributedString.append(makeBreakLine(in: markup)) // Add line break, for details refer to Source Code
attributedString.insert(makeBreakLine(in: markup), at: 0) // Add line break, for details refer to Source Code
return attributedString
}
The final presentation effect is as follows:
not perfect, but acceptable.
Complex Rendering Items — Image
Finally, let’s talk about the biggest challenge, loading remote images into NSAttributedString.
Use <img>
to represent images in HTML:
1
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg" width="300" height="125"/>
You can specify the desired display size through the width
/ height
HTML attributes.
Displaying images in NSAttributedString is much more complicated than imagined; and there is no good implementation. Previously, when doing UITextView text wrapping, I encountered some pitfalls, but after researching again, I found that there is still no perfect solution.
For now, let’s ignore the issue that NSTextAttachment natively cannot reuse and release memory. We will first implement downloading images from remote and placing them into NSTextAttachment, then into NSAttributedString, and achieve automatic content updates.
This series of operations is split into another small project for better optimization and reuse in other projects in the future:
Mainly referring to Asynchronous NSTextAttachments series of articles for implementation, but replacing the final content update part (refreshing the UI after downloading) and adding Delegate/DataSource for external extension use.
Operation flow and relationship as shown in the figure above:
- Declare ZNSTextAttachmentable object, encapsulating NSTextStorage object (UITextView built-in) and UILabel itself (UILabel has no NSTextStorage) The operation method is only to implement replace attributedString from NSRange. (
func replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment)
) - The principle is to use
ZNSTextAttachment
to package imageURL, PlaceholderImage, and the size information to be displayed, then directly display the image with placeHolder - When the system needs this image on the screen, it will call the
image(forBounds…
method, at which point we start downloading the Image Data - DataSource goes out to let the external decide how to download or implement the Image Cache Policy, by default directly using URLSession to request image Data
- After downloading, create a new
ZResizableNSTextAttachment
and implement the custom image size logic inattachmentBounds(for…
- Call the
replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment)
method to replace theZNSTextAttachment
position withZResizableNSTextAttachment
- Issue didLoad Delegate notification, allowing external connection if needed
- Complete
For detailed code, refer to Source Code.
The reason for not using NSLayoutManager.invalidateLayout(forCharacterRange: range, actualCharacterRange: nil)
or NSLayoutManager.invalidateDisplay(forCharacterRange: range)
to refresh the UI is that the UI did not correctly display the update; since the Range is known, directly triggering the replacement of NSAttributedString ensures the UI is correctly updated.
The final display result is as follows:
1
2
<span style="color:red">こんにちは</span>こんにちはこんにちは <br />
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg"/>
Testing & Continuous Integration
In this project, in addition to writing Unit Tests, Snapshot Tests were also established for integration testing to facilitate comprehensive testing and comparison of the final NSAttributedString.
The main functional logic has UnitTests and integration tests. The final Test Coverage is around 85%.
Snapshot Test
Directly use the framework:
1
2
3
4
5
6
7
8
9
10
11
12
13
import SnapshotTesting
// ...
func testShouldKeppNSAttributedString() {
let parser = ZHTMLParserBuilder.initWithDefault().build()
let textView = UITextView()
textView.frame.size.width = 390
textView.isScrollEnabled = false
textView.backgroundColor = .white
textView.setHtmlString("html string...", with: parser)
textView.layoutIfNeeded()
assertSnapshot(matching: textView, as: .image, record: false)
}
// ...
Directly compare the final result to see if it meets expectations, ensuring that the integration adjustments are not abnormal.
Codecov Test Coverage
Integrate Codecov.io (free for Public Repo) to evaluate Test Coverage. Just install the Codecov Github App & configure it.
After setting up Codecov <-> Github Repo, you can also add codecov.yml
to the root directory of the project
1
2
3
4
5
6
comment: # this is a top-level key
layout: "reach, diff, flags, files"
behavior: default
require_changes: false # if true: only post the comment if coverage changes
require_base: no # [yes :: must have a base report to post]
require_head: yes # [yes :: must have a head report to post]
Configuration file, this can enable the CI results to be automatically commented on the content after each PR is issued.
Continuous Integration
Github Action, CI integration: ci.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
name: CI
on:
workflow_dispatch:
pull_request:
types: [opened, reopened]
push:
branches:
- main
jobs:
build:
runs-on: self-hosted
steps:
- uses: actions/checkout@v3
- name: spm build and test
run: |
set -o pipefail
xcodebuild test -workspace ZMarkupParser.xcworkspace -testPlan ZMarkupParser -scheme ZMarkupParser -enableCodeCoverage YES -resultBundlePath './scripts/TestResult.xcresult' -destination 'platform=iOS Simulator,name=iPhone 14,OS=16.1' build test | xcpretty
- name: Codecov
uses: codecov/codecov-action@v3.1.1
with:
xcode: true
xcode_archive_path: './scripts/TestResult.xcresult'
This configuration runs build and test when PR is opened/reopened or push to the main branch, and finally uploads the test coverage report to codecov.
Regex
Regarding regular expressions, each use improves it further; this time, not much was used, but because I originally wanted to use a regex to extract paired HTML Tags, I also studied how to write it.
Some new cheat sheet notes learned this time…
?:
allows ( ) to match group results but not capture them e.g.(?:https?:\/\/)?(?:www\.)?example\.com
will return the entire URL inhttps://www.example.com
instead ofhttps://
,www
.+?
non-greedy match (returns the nearest) e.g.<.+?>
will return<a>
,</a>
in<a>test</a>
instead of the entire string(?=XYZ)
any string until theXYZ
string appears; note that another similar one[^XYZ]
means any string untilX or Y or Z
character appears e.g.(?:__)(.+?(?=__))(?:__)
(any string until__
) will matchtest
?R
recursively finds values with the same rule e.g.\((?:[^()]|((?R)))+\)
will match(simple)
,(and(nested))
,(nested)
in(simple) (and(nested))
?<GroupName>
…\k<GroupName>
matches the previous Group Name e.g.(?<tagName><a>).*(\k<GroupName>)
(?(X)yes|no)
matches the conditionyes
if theX
match result has a value (can also use Group Name), otherwise matchesno
Swift does not support this yet
Other good articles on Regex:
- Swift Regex Quick Reference
- How do regular expressions work? -> Can refer to this when optimizing the regex performance of this project later
- Case of Regex error causing infinite search, eventually leading to server failure
- Regex101 bottom right corner can query all regex rules
Swift Package Manager & Cocoapods
This is also my first time developing with SPM & Cocoapods… It’s quite interesting, SPM is really convenient; but if you encounter a situation where two projects depend on the same package, opening both projects at the same time will result in one of them not finding the package and failing to build…
Cocoapods has uploaded ZMarkupParser but hasn’t tested if it’s working properly, because I’m using SPM 😝.
ChatGPT
From the actual development experience, I found it most useful only when assisting in editing the Readme; in development, I haven’t felt any significant impact yet. When asking mid-senior level questions, it couldn’t provide a definite answer or even gave incorrect answers (I encountered some incorrect regex rules). So, in the end, I still turned to Google for the correct answers.
Not to mention asking it to write code, unless it’s simple Code Gen Object; otherwise, don’t expect it to complete the entire tool architecture directly. (At least for now, it seems that Copilot might be more helpful for writing code)
However, it can provide a general direction for knowledge blind spots, allowing us to quickly get a rough idea of how certain things should be done. Sometimes, when the understanding is too low, it’s hard to quickly pinpoint the correct direction on Google, and that’s when ChatGPT is quite helpful.
Disclaimer
After more than three months of research and development, I am exhausted, but I still need to declare that this approach is only a feasible result obtained after my research. It is not necessarily the best solution, and there may still be areas for optimization. This project is more like a starting point, hoping to get a perfect solution for Markup Language to NSAttributedString. Everyone is very welcome to contribute; many things still need the power of the community to be perfected.
Contributing
Here are some areas that I think can be improved as of now (2023/03/12), and will be recorded in the Repo later:
- Optimization of performance/algorithm, although it is faster and more stable than the native
NSAttributedString.DocumentType.html
; there is still much room for optimization. I believe the performance is definitely not as good as XMLParser; I hope one day it can have the same performance while maintaining customization and automatic error correction. - Support for more HTML Tag, Style Attribute conversion parsing
- Further optimization of ZNSTextAttachment, implementing reuse capability, releasing memory; may need to research CoreText
- Support for Markdown parsing, as the underlying abstraction is not limited to HTML; so as long as the front-end Markdown to Markup object is built, Markdown parsing can be completed; hence I named it ZMarkupParser, not ZHTMLParser, hoping that one day it can also support Markdown to NSAttributedString
- Support for Any to Any, e.g. HTML To Markdown, Markdown To HTML, as we have the original AST tree (Markup object), so achieving conversion between any Markup is possible
- Implement css
!important
functionality, enhancing the inheritance strategy of abstract MarkupStyle - Enhance HTML Selector functionality, currently, it is just the most basic filter functionality
- Many more, welcome to open issue
Summary
Here are all the technical details and the journey of developing ZMarkupParser. It took me almost three months of after-work and weekend time, countless research and practice, writing tests, improving Test Coverage, and setting up CI; finally, there is a somewhat decent result. I hope this tool solves the same problems for others and that everyone can help make this tool even better.
It is currently applied in our company’s pinkoi.com iOS App version, and no issues have been found. 😄
Further Reading
If you have any questions or feedback, feel free to contact me.
===
===
This article was first published in Traditional Chinese on Medium ➡️ View Here