Home The Craft of Building a Handmade HTML Parser
Post
Cancel

The Craft of Building a Handmade HTML Parser

The Craft of Building a Handmade HTML Parser

The development log of ZMarkupParser HTML to NSAttributedString rendering engine

Tokenization conversion of HTML String, Normalization processing, generation of Abstract Syntax Tree, application of Visitor Pattern / Builder Pattern, and some miscellaneous discussions…

Continuation

Last year, I published an article titled “[ TL;DR ] Implementing iOS NSAttributedString HTML Render”, which briefly introduced how to use XMLParser to parse HTML and then convert it into NSAttributedString.Key. The structure and thought process in the article were quite disorganized, as it was a quick record of the issues encountered previously and I did not spend much time researching the topic.

Convert HTML String to NSAttributedString

Revisiting this topic, we need to be able to convert the HTML string provided by the API into NSAttributedString and apply the corresponding styles to display it in UITextView/UILabel.

e.g. <b>Test<a>Link</a></b> should be displayed as Test Link

  • Note 1 It is not recommended to use HTML as a communication and rendering medium between the App and data, as the HTML specification is too flexible. The App cannot support all HTML styles, and there is no official HTML conversion rendering engine.
  • Note 2 Starting from iOS 14, you can use the native AttributedString to parse Markdown or introduce the apple/swift-markdown Swift Package to parse Markdown.
  • Note 3 Due to the large scale of our company’s project and the long-term use of HTML as a medium, it is temporarily impossible to fully switch to Markdown or other Markup.
  • Note 4 The HTML here is not intended to display the entire HTML webpage, but to use HTML as a style Markdown rendering string style. (To render a full page, complex HTML including images and tables, you still need to use WebView loadHTML)

It is strongly recommended to use Markdown as the string rendering medium language. If your project has the same dilemma as mine and you have no elegant tool to convert HTML to NSAttributedString, please use it.

Friends who remember the previous article can directly jump to the ZhgChgLi / ZMarkupParser section.

NSAttributedString.DocumentType.html

The methods for HTML to NSAttributedString found online all suggest directly using NSAttributedString’s built-in options to render HTML, as shown in the example below:

1
2
3
4
5
6
7
let htmlString = "<b>Test<a>Link</a></b>"
let data = htmlString.data(using: String.Encoding.utf8)!
let attributedOptions:[NSAttributedString.DocumentReadingOptionKey: Any] = [
  .documentType :NSAttributedString.DocumentType.html,
  .characterEncoding: String.Encoding.utf8.rawValue
]
let attributedString = try! NSAttributedString(data: data, options: attributedOptions, documentAttributes: nil)

The problem with this approach:

  • Poor performance: This method uses WebView Core to render the style, then switches back to the Main Thread for UI display; rendering more than 300 characters takes 0.03 seconds.
  • Text loss: For example, marketing copy might use <Congratulation!> which will be treated as an HTML tag and removed.
  • Lack of customization: For example, you cannot specify the boldness level of HTML bold tags in NSAttributedString.
  • Intermittent crashes starting from iOS ≥ 12 with no official solution
  • Frequent crashes in iOS 15, testing found that it crashes 100% under low battery conditions (fixed in iOS ≥ 15.2)
  • Long strings cause crashes, testing shows that inputting strings longer than 54,600+ characters will crash 100% (EXC_BAD_ACCESS)

The most painful issue for us is the crash problem. From the release of iOS 15 to the fix in 15.2, our app was plagued by this issue. From the data, between 2022/03/11 and 2022/06/08, it caused over 2.4K crashes, affecting over 1.4K users.

This crash issue has existed since iOS 12, and iOS 15 just made it worse. I guess the fix in iOS 15.2 is just a patch, and the official solution cannot completely eradicate it.

The second issue is performance. As a string style Markup Language, it is heavily used in the app’s UILabel/UITextView. As mentioned earlier, one label takes 0.03 seconds, and multiplying this by the number of UILabel/UITextView in a list will cause noticeable lag in user interactions.

XMLParser

The second solution is introduced in the previous article, which uses XMLParser to parse into corresponding NSAttributedString keys and apply styles.

Refer to the implementation of SwiftRichString and the content of the previous article.

The previous article only explored using XMLParser to parse HTML and perform corresponding conversions, completing an experimental implementation, but it did not design it as a well-structured and extensible “tool.”

The problem with this approach:

  • Zero tolerance for errors: <br> / <Congratulation!> / <b>Bold<i>Bold+Italic</b>Italic</i> These three possible HTML scenarios will cause XMLParser to throw an error and display blank.
  • Using XMLParser, the HTML string must fully comply with XML rules, unlike browsers or NSAttributedString.DocumentType.html which can tolerate and display correctly.

Standing on the shoulders of giants

Neither of the above two solutions can perfectly and elegantly solve the HTML problem, so I started searching for existing solutions.

After searching extensively, I found that the results are similar to the projects mentioned above. There are no giants’ shoulders to stand on.

ZhgChgLi/ZMarkupParser

Without the shoulders of giants, I had to become a giant myself, so I developed an HTML String to NSAttributedString tool.

Developed purely in Swift, it parses HTML Tags using Regex and performs Tokenization, analyzing and correcting Tag accuracy (fixing tags without an end & misplaced tags), then converts it into an abstract syntax tree. Finally, using the Visitor Pattern, it maps HTML Tags to abstract styles to get the final NSAttributedString result; it does not rely on any Parser Lib.

Features

  • Supports HTML Render (to NSAttributedString) / Stripper (removing HTML Tags) / Selector functions
  • Higher performance than NSAttributedString.DocumentType.html
  • Automatically analyzes and corrects Tag accuracy (fixing tags without an end & misplaced tags)
  • Supports dynamic style settings from style="color:red..."
  • Supports custom style specifications, such as how bold bold should be
  • Supports flexible extensibility for tags or custom tags and attributes

For detailed introduction, installation, and usage, refer to this article: ZMarkupParser HTML String to NSAttributedString Tool

You can directly git clone the project, then open the ZMarkupParser.xcworkspace Project, select the ZMarkupParser-Demo Target, and directly Build & Run to try it out.

[ZMarkupParser](https://github.com/ZhgChgLi/ZMarkupParser){:target="_blank"}

ZMarkupParser

Technical Details

Now, let’s dive into the technical details of developing this tool.

Overview of the operation process

Overview of the operation process

The above image shows the general operation process, and the following article will introduce it step by step with code examples.

⚠️ This article will simplify Demo Code as much as possible, reduce abstraction and performance considerations, and focus on explaining the operation principles; for the final result, please refer to the project Source Code.

Code Implementation — Tokenization

a.k.a parser, parsing

When it comes to HTML rendering, the most important part is parsing. In the past, HTML was parsed as XML using XMLParser; however, it couldn’t handle the fact that HTML usage is not 100% XML, causing parser errors and inability to dynamically correct them.

After ruling out the use of XMLParser, the only option left in Swift was to use Regex for matching and parsing.

Initially, the idea was to use Regex to extract “paired” HTML Tags and recursively find HTML Tags layer by layer until the end; however, this couldn’t solve the problem of nested HTML Tags or support for misplaced tags. Therefore, we changed the strategy to extract “single” HTML Tags, recording whether they are Start Tags, Close Tags, or Self-Closing Tags, and combining other strings into a parsed result array.

Tokenization structure is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
enum HTMLParsedResult {
    case start(StartItem) // <a>
    case close(CloseItem) // </a>
    case selfClosing(SelfClosingItem) // <br/>
    case rawString(NSAttributedString)
}

extension HTMLParsedResult {
    class SelfClosingItem {
        let tagName: String
        let tagAttributedString: NSAttributedString
        let attributes: [String: String]?
        
        init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
            self.tagName = tagName
            self.tagAttributedString = tagAttributedString
            self.attributes = attributes
        }
    }
    
    class StartItem {
        let tagName: String
        let tagAttributedString: NSAttributedString
        let attributes: [String: String]?

        // Start Tag may be an abnormal HTML Tag or normal text e.g. <Congratulation!>, if found to be an isolated Start Tag after subsequent Normalization, it will be marked as True.
        var isIsolated: Bool = false
        
        init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
            self.tagName = tagName
            self.tagAttributedString = tagAttributedString
            self.attributes = attributes
        }
        
        // Used for automatic padding correction in subsequent Normalization
        func convertToCloseParsedItem() -> CloseItem {
            return CloseItem(tagName: self.tagName)
        }
        
        // Used for automatic padding correction in subsequent Normalization
        func convertToSelfClosingParsedItem() -> SelfClosingItem {
            return SelfClosingItem(tagName: self.tagName, tagAttributedString: self.tagAttributedString, attributes: self.attributes)
        }
    }
    
    class CloseItem {
        let tagName: String
        init(tagName: String) {
            self.tagName = tagName
        }
    }
}

The regex used is as follows:

1
<(?:(?<closeTag>\/)?(?<tagName>[A-Za-z0-9]+)(?<tagAttributes>(?:\s*(\w+)\s*=\s*(["|']).*?\5)*)\s*(?<selfClosingTag>\/)?>)

-> Online Regex101 Playground

  • closeTag: Matches < / a>
  • tagName: Matches < a > or , </ a >
  • tagAttributes: Matches <a href=”https://zhgchg.li” style=”color:red” >
  • selfClosingTag: Matches <br / >

*This regex can still be optimized, will do it later.

Additional information about regex is provided in the latter part of the article, interested friends can refer to it.

Combining it all together:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
var tokenizationResult: [HTMLParsedResult] = []

let expression = try? NSRegularExpression(pattern: pattern, options: expressionOptions)
let attributedString = NSAttributedString(string: "<a>Li<b>nk</a>Bold</b>")
let totalLength = attributedString.string.utf16.count // utf-16 support emoji
var lastMatch: NSTextCheckingResult?

// Start Tags Stack, First In Last Out (FILO)
// Check if the HTML string needs subsequent normalization to correct misalignment or add self-closing tags
var stackStartItems: [HTMLParsedResult.StartItem] = []
var needForamatter: Bool = false

expression.enumerateMatches(in: attributedString.string, range: NSMakeRange(0, totoalLength)) { match, _, _ in
  if let match = match {
    // Check the string between tags or before the first tag
    // e.g. Test<a>Link</a>zzz<b>bold</b>Test2 - > Test,zzz
    let lastMatchEnd = lastMatch?.range.upperBound ?? 0
    let currentMatchStart = match.range.lowerBound
    if currentMatchStart > lastMatchEnd {
      let rawStringBetweenTag = attributedString.attributedSubstring(from: NSMakeRange(lastMatchEnd, (currentMatchStart - lastMatchEnd)))
      tokenizationResult.append(.rawString(rawStringBetweenTag))
    }

    // <a href="https://zhgchg.li">, </a>
    let matchAttributedString = attributedString.attributedSubstring(from: match.range)
    // a, a
    let matchTag = attributedString.attributedSubstring(from: match.range(withName: "tagName"))?.string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
    // false, true
    let matchIsEndTag = matchResult.attributedString(from: match.range(withName: "closeTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"
    // href="https://zhgchg.li", nil
    // Use regex to further extract HTML attributes, to [String: String], refer to the source code
    let matchTagAttributes = parseAttributes(matchResult.attributedString(from: match.range(withName: "tagAttributes")))
    // false, false
    let matchIsSelfClosingTag = matchResult.attributedString(from: match.range(withName: "selfClosingTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"

    if let matchAttributedString = matchAttributedString,
       let matchTag = matchTag {
        if matchIsSelfClosingTag {
          // e.g. <br/>
          tokenizationResult.append(.selfClosing(.init(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)))
        } else {
          // e.g. <a> or </a>
          if matchIsEndTag {
            // e.g. </a>
            // Retrieve the position of the same tag name from the stack, starting from the last
            if let index = stackStartItems.lastIndex(where: { $0.tagName == matchTag }) {
              // If it's not the last one, it means there is a misalignment or a missing closing tag
              if index != stackStartItems.count - 1 {
                  needForamatter = true
              }
              tokenizationResult.append(.close(.init(tagName: matchTag)))
              stackStartItems.remove(at: index)
            } else {
              // Extra close tag e.g </a>
              // Does not affect subsequent processing, just ignore
            }
          } else {
            // e.g. <a>
            let startItem: HTMLParsedResult.StartItem = HTMLParsedResult.StartItem(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)
            tokenizationResult.append(.start(startItem))
            // Add to stack
            stackStartItems.append(startItem)
          }
        }
     }

    lastMatch = match
  }
}

// Check the ending raw string
// e.g. Test<a>Link</a>Test2 - > Test2
if let lastMatch = lastMatch {
  let currentIndex = lastMatch.range.upperBound
  if totoalLength > currentIndex {
    // There are remaining strings
    let resetString = attributedString.attributedSubstring(from: NSMakeRange(currentIndex, (totoalLength - currentIndex)))
    tokenizationResult.append(.rawString(resetString))
  }
} else {
  // lastMatch = nil, meaning no tags were found, all are plain text
  let resetString = attributedString.attributedSubstring(from: NSMakeRange(0, totoalLength))
  tokenizationResult.append(.rawString(resetString))
}

// Check if the stack is empty, if not, it means there are start tags without corresponding end tags
// Mark as isolated start tags
for stackStartItem in stackStartItems {
  stackStartItem.isIsolated = true
  needForamatter = true
}

print(tokenizationResult)
// [
//    .start("a",["href":"https://zhgchg.li"])
//    .rawString("Li")
//    .start("b",nil)
//    .rawString("nk")
//    .close("a")
//    .rawString("Bold")
//    .close("b")
// ]

Operation flow as shown in the figure

Operation flow as shown in the figure

The final result will be an array of Tokenization results.

Corresponding source code in HTMLStringToParsedResultProcessor.swift implementation

Normalization

a.k.a Formatter, normalization

After obtaining the preliminary parsing results in the previous step, if it is found during parsing that further normalization is needed, this step is required to automatically correct HTML Tag issues.

There are three types of HTML Tag issues:

  • HTML Tag but missing Close Tag: e.g., <br>
  • General text mistaken as HTML Tag: e.g., <Congratulation!>
  • HTML Tag misalignment issues: e.g., <a>Li<b>nk</a>Bold</b>

The correction method is also very simple. We need to traverse the elements of the Tokenization results and try to fill in the gaps.

Operation flow as shown in the figure

Operation flow as shown in the figure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
var normalizationResult = tokenizationResult

// Start Tags Stack, First In Last Out (FILO)
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
var itemIndex = 0
while itemIndex < newItems.count {
    switch newItems[itemIndex] {
    case .start(let item):
        if item.isIsolated {
            // If it is an isolated Start Tag
            if WC3HTMLTagName(rawValue: item.tagName) == nil && (item.attributes?.isEmpty ?? true) {
                // If it is not a WCS defined HTML Tag & has no HTML Attribute
                // WC3HTMLTagName Enum can refer to Source Code
                // Determine as general text mistaken as HTML Tag
                // Change to raw string type
                normalizationResult[itemIndex] = .rawString(item.tagAttributedString)
            } else {
                // Otherwise, change to self-closing tag, e.g., <br> -> <br/>
                normalizationResult[itemIndex] = .selfClosing(item.convertToSelfClosingParsedItem())
            }
            itemIndex += 1
        } else {
            // Normal Start Tag, add to Stack
            stackExpectedStartItems.append(item)
            itemIndex += 1
        }
    case .close(let item):
        // Encounter Close Tag
        // Get the Tags between the Start Stack Tag and this Close Tag
        // e.g., <a><u><b>[CurrentIndex]</a></u></b> -> interval 0
        // e.g., <a><u><b>[CurrentIndex]</a></u></b> -> interval b,u

        let reversedStackExpectedStartItems = Array(stackExpectedStartItems.reversed())
        guard let reversedStackExpectedStartItemsOccurredIndex = reversedStackExpectedStartItems.firstIndex(where: { $0.tagName == item.tagName }) else {
            itemIndex += 1
            continue
        }
        
        let reversedStackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItems.prefix(upTo: reversedStackExpectedStartItemsOccurredIndex))
        
        // Interval 0, means no tag misalignment
        guard reversedStackExpectedStartItemsOccurred.count != 0 else {
            // is pair, pop
            stackExpectedStartItems.removeLast()
            itemIndex += 1
            continue
        }
        
        // There are other intervals, automatically fill in the interval Tags
        // e.g., <a><u><b>[CurrentIndex]</a></u></b> ->
        // e.g., <a><u><b>[CurrentIndex]</b></u></a><b></u></u></b>
        let stackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItemsOccurred.reversed())
        let afterItems = stackExpectedStartItemsOccurred.map({ HTMLParsedResult.start($0) })
        let beforeItems = reversedStackExpectedStartItemsOccurred.map({ HTMLParsedResult.close($0.convertToCloseParsedItem()) })
        normalizationResult.insert(contentsOf: afterItems, at: newItems.index(after: itemIndex))
        normalizationResult.insert(contentsOf: beforeItems, at: itemIndex)
        
        itemIndex = newItems.index(after: itemIndex) + stackExpectedStartItemsOccurred.count
        
        // Update Start Stack Tags
        // e.g., -> b,u
        stackExpectedStartItems.removeAll { startItem in
            return reversedStackExpectedStartItems.prefix(through: reversedStackExpectedStartItemsOccurredIndex).contains(where: { $0 === startItem })
        }
    case .selfClosing, .rawString:
        itemIndex += 1
    }
}

print(normalizationResult)
// [
//    .start("a",["href":"https://zhgchg.li"])
//    .rawString("Li")
//    .start("b",nil)
//    .rawString("nk")
//    .close("b")
//    .close("a")
//    .start("b",nil)
//    .rawString("Bold")
//    .close("b")
// ]

Corresponding implementation in the source code HTMLParsedResultFormatterProcessor.swift

Abstract Syntax Tree

a.k.a AST, Abstract Tree

After the Tokenization & Normalization data preprocessing is completed, the result needs to be converted into an abstract tree 🌲.

As shown in the figure

As shown in the figure

Converting into an abstract tree facilitates our future operations and extensions, such as implementing Selector functionality or other conversions like HTML to Markdown; or if we want to add Markdown to NSAttributedString in the future, we only need to implement Markdown’s Tokenization & Normalization to complete it.

First, we define a Markup Protocol with Child & Parent properties to record the information of leaves and branches:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
protocol Markup: AnyObject {
    var parentMarkup: Markup? { get set }
    var childMarkups: [Markup] { get set }
    
    func appendChild(markup: Markup)
    func prependChild(markup: Markup)
    func accept<V: MarkupVisitor>(_ visitor: V) -> V.Result
}

extension Markup {
    func appendChild(markup: Markup) {
        markup.parentMarkup = self
        childMarkups.append(markup)
    }
    
    func prependChild(markup: Markup) {
        markup.parentMarkup = self
        childMarkups.insert(markup, at: 0)
    }
}

Additionally, using the Visitor Pattern, each style attribute is defined as an object Element, and different Visit strategies are used to obtain individual application results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
protocol MarkupVisitor {
    associatedtype Result
        
    func visit(markup: Markup) -> Result
    
    func visit(_ markup: RootMarkup) -> Result
    func visit(_ markup: RawStringMarkup) -> Result
    
    func visit(_ markup: BoldMarkup) -> Result
    func visit(_ markup: LinkMarkup) -> Result
    //...
}

extension MarkupVisitor {
    func visit(markup: Markup) -> Result {
        return markup.accept(self)
    }
}

Basic Markup nodes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Root node
final class RootMarkup: Markup {
    weak var parentMarkup: Markup? = nil
    var childMarkups: [Markup] = []
    
    func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
        return visitor.visit(self)
    }
}

// Leaf node
final class RawStringMarkup: Markup {
    let attributedString: NSAttributedString
    
    init(attributedString: NSAttributedString) {
        self.attributedString = attributedString
    }
    
    weak var parentMarkup: Markup? = nil
    var childMarkups: [Markup] = []
    
    func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
        return visitor.visit(self)
    }
}

Define Markup Style Nodes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Branch nodes:

// Link style
final class LinkMarkup: Markup {
    weak var parentMarkup: Markup? = nil
    var childMarkups: [Markup] = []
    
    func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
        return visitor.visit(self)
    }
}

// Bold style
final class BoldMarkup: Markup {
    weak var parentMarkup: Markup? = nil
    var childMarkups: [Markup] = []
    
    func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
        return visitor.visit(self)
    }
}

Corresponding implementation in the source code Markup

Before converting to an abstract tree, we also need…

MarkupComponent

Because our tree structure does not depend on any data structure (for example, a node/LinkMarkup should have URL information to perform subsequent rendering). For this, we define a container to store tree nodes and related data information:

1
2
3
4
5
6
7
8
9
10
11
12
13
protocol MarkupComponent {
    associatedtype T
    var markup: Markup { get }
    var value: T { get }
    
    init(markup: Markup, value: T)
}

extension Sequence where Iterator.Element: MarkupComponent {
    func value(markup: Markup) -> Element.T? {
        return self.first(where:{ $0.markup === markup })?.value as? Element.T
    }
}

Corresponding implementation in the source code MarkupComponent

You can also declare Markup as Hashable and directly use Dictionary to store values [Markup: Any], but in this way, Markup cannot be used as a general type and needs to be prefixed with any Markup.

HTMLTag & HTMLTagName & HTMLTagNameVisitor

We also abstracted the HTML Tag Name part, allowing users to decide which tags need to be processed and facilitating future extensions. For example, the <strong> Tag Name can correspond to BoldMarkup.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public protocol HTMLTagName {
    var string: String { get }
    func accept<V: HTMLTagNameVisitor>(_ visitor: V) -> V.Result
}

public struct A_HTMLTagName: HTMLTagName {
    public let string: String = WC3HTMLTagName.a.rawValue
    
    public init() {
        
    }
    
    public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
        return visitor.visit(self)
    }
}

public struct B_HTMLTagName: HTMLTagName {
    public let string: String = WC3HTMLTagName.b.rawValue
    
    public init() {
        
    }
    
    public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
        return visitor.visit(self)
    }
}

Corresponding implementation in the source code HTMLTagNameVisitor

Additionally, refer to the W3C wiki which lists the HTML tag name enum: WC3HTMLTagName.swift

HTMLTag is simply a container object because we want to allow external specification of the style corresponding to the HTML Tag, so we declare a container to put them together:

1
2
3
4
5
6
7
8
9
struct HTMLTag {
    let tagName: HTMLTagName
    let customStyle: MarkupStyle? // Render will be explained later
    
    init(tagName: HTMLTagName, customStyle: MarkupStyle? = nil) {
        self.tagName = tagName
        self.customStyle = customStyle
    }
}

Corresponding implementation in the source code HTMLTag

HTMLTagNameToHTMLMarkupVisitor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct HTMLTagNameToMarkupVisitor: HTMLTagNameVisitor {
    typealias Result = Markup
    
    let attributes: [String: String]?
    
    func visit(_ tagName: A_HTMLTagName) -> Result {
        return LinkMarkup()
    }
    
    func visit(_ tagName: B_HTMLTagName) -> Result {
        return BoldMarkup()
    }
    //...
}

Corresponding implementation in the source code HTMLTagNameToHTMLMarkupVisitor

Convert to Abstract Tree with HTML Data

We need to convert the result of the normalized HTML data into an abstract tree. First, declare a MarkupComponent data structure that can store HTML data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct HTMLElementMarkupComponent: MarkupComponent {
    struct HTMLElement {
        let tag: HTMLTag
        let tagAttributedString: NSAttributedString
        let attributes: [String: String]?
    }
    
    typealias T = HTMLElement
    
    let markup: Markup
    let value: HTMLElement
    init(markup: Markup, value: HTMLElement) {
        self.markup = markup
        self.value = value
    }
}

Convert to Markup Abstract Tree:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
var htmlElementComponents: [HTMLElementMarkupComponent] = []
let rootMarkup = RootMarkup()
var currentMarkup: Markup = rootMarkup

let htmlTags: [String: HTMLTag]
init(htmlTags: [HTMLTag]) {
  self.htmlTags = Dictionary(uniqueKeysWithValues: htmlTags.map{ ($0.tagName.string, $0) })
}

// Start Tags Stack, ensure correct pop tag
// Normalization has already been done before, it should not go wrong, just to ensure
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
for thisItem in from {
    switch thisItem {
    case .start(let item):
        let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
        let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
        // Use Visitor to ask for the corresponding Markup
        let markup = visitor.visit(tagName: htmlTag.tagName)
        
        // Add itself to the current branch's leaf node
        // Itself becomes the current branch node
        htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
        currentMarkup.appendChild(markup: markup)
        currentMarkup = markup
        
        stackExpectedStartItems.append(item)
    case .selfClosing(let item):
        // Directly add to the current branch's leaf node
        let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
        let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
        let markup = visitor.visit(tagName: htmlTag.tagName)
        htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
        currentMarkup.appendChild(markup: markup)
    case .close(let item):
        if let lastTagName = stackExpectedStartItems.popLast()?.tagName,
           lastTagName == item.tagName {
            // When encountering Close Tag, return to the previous level
            currentMarkup = currentMarkup.parentMarkup ?? currentMarkup
        }
    case .rawString(let attributedString):
        // Directly add to the current branch's leaf node
        currentMarkup.appendChild(markup: RawStringMarkup(attributedString: attributedString))
    }
}

// print(htmlElementComponents)
// [(markup: LinkMarkup, (tag: a, attributes: ["href":"zhgchg.li"]...)]

Operation result as shown in the figure

Operation result as shown in the figure

Corresponding source code implementation in HTMLParsedResultToHTMLElementWithRootMarkupProcessor.swift

At this point, we have actually completed the functionality of the Selector 🎉

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class HTMLSelector: CustomStringConvertible {
    
    let markup: Markup
    let componets: [HTMLElementMarkupComponent]
    init(markup: Markup, componets: [HTMLElementMarkupComponent]) {
        self.markup = markup
        self.componets = componets
    }
    
    public func filter(_ htmlTagName: String) -> [HTMLSelector] {
        let result = markup.childMarkups.filter({ componets.value(markup: $0)?.tag.tagName.isEqualTo(htmlTagName) ?? false })
        return result.map({ .init(markup: $0, componets: componets) })
    }

    //...
}

We can filter leaf node objects layer by layer.

Corresponding source code implementation in HTMLSelector

Parser — HTML to MarkupStyle (Abstract of NSAttributedString.Key)

Next, we need to complete the conversion of HTML to MarkupStyle (NSAttributedString.Key).

NSAttributedString sets the text style through NSAttributedString.Key Attributes. We abstract all fields of NSAttributedString.Key to correspond to MarkupStyle, MarkupStyleColor, MarkupStyleFont, MarkupStyleParagraphStyle.

Purpose:

  • The original data structure of Attributes is [NSAttributedString.Key: Any?]. If exposed directly, it is difficult to control the values users input, and incorrect values may cause crashes, such as .font: 123.
  • Styles need to be inheritable, such as <a><b>test</b></a>, where the style of the test string inherits from the link’s bold (bold+link); if the Dictionary is exposed directly, it is difficult to control the inheritance rules.
  • Encapsulate iOS/macOS (UIKit/Appkit) related objects.

MarkupStyle Struct

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
public struct MarkupStyle {
    public var font:MarkupStyleFont
    public var paragraphStyle:MarkupStyleParagraphStyle
    public var foregroundColor:MarkupStyleColor? = nil
    public var backgroundColor:MarkupStyleColor? = nil
    public var ligature:NSNumber? = nil
    public var kern:NSNumber? = nil
    public var tracking:NSNumber? = nil
    public var strikethroughStyle:NSUnderlineStyle? = nil
    public var underlineStyle:NSUnderlineStyle? = nil
    public var strokeColor:MarkupStyleColor? = nil
    public var strokeWidth:NSNumber? = nil
    public var shadow:NSShadow? = nil
    public var textEffect:String? = nil
    public var attachment:NSTextAttachment? = nil
    public var link:URL? = nil
    public var baselineOffset:NSNumber? = nil
    public var underlineColor:MarkupStyleColor? = nil
    public var strikethroughColor:MarkupStyleColor? = nil
    public var obliqueness:NSNumber? = nil
    public var expansion:NSNumber? = nil
    public var writingDirection:NSNumber? = nil
    public var verticalGlyphForm:NSNumber? = nil
    //...

    // Inherited from...
    // Default: When the field is nil, fill in the current data object from 'from'
    mutating func fillIfNil(from: MarkupStyle?) {
        guard let from = from else { return }
        
        var currentFont = self.font
        currentFont.fillIfNil(from: from.font)
        self.font = currentFont
        
        var currentParagraphStyle = self.paragraphStyle
        currentParagraphStyle.fillIfNil(from: from.paragraphStyle)
        self.paragraphStyle = currentParagraphStyle
        //..
    }

    // MarkupStyle to NSAttributedString.Key: Any
    func render() -> [NSAttributedString.Key: Any] {
        var data: [NSAttributedString.Key: Any] = [:]
        
        if let font = font.getFont() {
            data[.font] = font
        }

        if let ligature = self.ligature {
            data[.ligature] = ligature
        }
        //...
        return data
    }
}

public struct MarkupStyleFont: MarkupStyleItem {
    public enum FontWeight {
        case style(FontWeightStyle)
        case rawValue(CGFloat)
    }
    public enum FontWeightStyle: String {
        case ultraLight, light, thin, regular, medium, semibold, bold, heavy, black
        // ...
    }
    
    public var size: CGFloat?
    public var weight: FontWeight?
    public var italic: Bool?
    //...
}

public struct MarkupStyleParagraphStyle: MarkupStyleItem {
    public var lineSpacing:CGFloat? = nil
    public var paragraphSpacing:CGFloat? = nil
    public var alignment:NSTextAlignment? = nil
    public var headIndent:CGFloat? = nil
    public var tailIndent:CGFloat? = nil
    public var firstLineHeadIndent:CGFloat? = nil
    public var minimumLineHeight:CGFloat? = nil
    public var maximumLineHeight:CGFloat? = nil
    public var lineBreakMode:NSLineBreakMode? = nil
    public var baseWritingDirection:NSWritingDirection? = nil
    public var lineHeightMultiple:CGFloat? = nil
    public var paragraphSpacingBefore:CGFloat? = nil
    public var hyphenationFactor:Float? = nil
    public var usesDefaultHyphenation:Bool? = nil
    public var tabStops: [NSTextTab]? = nil
    public var defaultTabInterval:CGFloat? = nil
    public var textLists: [NSTextList]? = nil
    public var allowsDefaultTighteningForTruncation:Bool? = nil
    public var lineBreakStrategy: NSParagraphStyle.LineBreakStrategy? = nil
    //...
}

public struct MarkupStyleColor {
    let red: Int
    let green: Int
    let blue: Int
    let alpha: CGFloat
    //...
}

Corresponding implementation in the source code MarkupStyle

Additionally, refer to the W3c wiki, browser predefined color name enumerates the corresponding color name text & color R,G,B enum: MarkupStyleColorName.swift

HTMLTagStyleAttribute & HTMLTagStyleAttributeVisitor

Let’s talk a bit more about these two objects because HTML Tags are allowed to be styled using CSS settings; for this, we abstract the HTMLTagName and apply it once again to the HTML Style Attribute.

For example, HTML might provide: <a style=”color:red;font-size:14px”>RedLink</a>, which means this link should be set to red and size 14px.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public protocol HTMLTagStyleAttribute {
    var styleName: String { get }
    
    func accept<V: HTMLTagStyleAttributeVisitor>(_ visitor: V) -> V.Result
}

public protocol HTMLTagStyleAttributeVisitor {
    associatedtype Result
    
    func visit(styleAttribute: HTMLTagStyleAttribute) -> Result
    func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result
    func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result
    //...
}

public extension HTMLTagStyleAttributeVisitor {
    func visit(styleAttribute: HTMLTagStyleAttribute) -> Result {
        return styleAttribute.accept(self)
    }
}

Corresponding implementation in the source code HTMLTagStyleAttribute

HTMLTagStyleAttributeToMarkupStyleVisitor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct HTMLTagStyleAttributeToMarkupStyleVisitor: HTMLTagStyleAttributeVisitor {
    typealias Result = MarkupStyle?
    
    let value: String
    
    func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result {
        // Regex to extract Color Hex or Mapping from HTML Pre-defined Color Name, please refer to the Source Code
        guard let color = MarkupStyleColor(string: value) else { return nil }
        return MarkupStyle(foregroundColor: color)
    }
    
    func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result {
        // Regex to extract 10px -> 10, please refer to the Source Code
        guard let size = self.convert(fromPX: value) else { return nil }
        return MarkupStyle(font: MarkupStyleFont(size: CGFloat(size)))
    }
    // ...
}

Corresponding implementation in the source code HTMLTagAttributeToMarkupStyleVisitor.swift

init’s value = attribute’s value, converted to the corresponding MarkupStyle field according to the visit type.

HTMLElementMarkupComponentMarkupStyleVisitor

After introducing the MarkupStyle object, we need to convert the result of Normalization’s HTMLElementComponents into MarkupStyle.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// MarkupStyle policy
public enum MarkupStylePolicy {
    case respectMarkupStyleFromCode // Prioritize from Code, fill in with HTML Style Attribute
    case respectMarkupStyleFromHTMLStyleAttribute // Prioritize from HTML Style Attribute, fill in with Code
}

struct HTMLElementMarkupComponentMarkupStyleVisitor: MarkupVisitor {

    typealias Result = MarkupStyle?
    
    let policy: MarkupStylePolicy
    let components: [HTMLElementMarkupComponent]
    let styleAttributes: [HTMLTagStyleAttribute]

    func visit(_ markup: BoldMarkup) -> Result {
        // .bold is just a default style defined in MarkupStyle, please refer to the Source Code
        return defaultVisit(components.value(markup: markup), defaultStyle: .bold)
    }
    
    func visit(_ markup: LinkMarkup) -> Result {
        // .link is just a default style defined in MarkupStyle, please refer to the Source Code
        var markupStyle = defaultVisit(components.value(markup: markup), defaultStyle: .link) ?? .link
        
        // Get the HtmlElement corresponding to LinkMarkup from HtmlElementComponents
        // Find the href parameter from the attributes of HtmlElement (HTML carries URL String)
        if let href = components.value(markup: markup)?.attributes?["href"] as? String,
           let url = URL(string: href) {
            markupStyle.link = url
        }
        return markupStyle
    }

    // ...
}

extension HTMLElementMarkupComponentMarkupStyleVisitor {
    // Get the custom MarkupStyle specified in the HTMLTag container
    private func customStyle(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?) -> MarkupStyle? {
        guard let customStyle = htmlElement?.tag.customStyle else {
            return nil
        }
        return customStyle
    }
    
    // Default action
    func defaultVisit(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?, defaultStyle: MarkupStyle? = nil) -> Result {
        var markupStyle: MarkupStyle? = customStyle(htmlElement) ?? defaultStyle
        // Get the HtmlElement corresponding to LinkMarkup from HtmlElementComponents
        // Check if the attributes of HtmlElement have a `Style` Attribute
        guard let styleString = htmlElement?.attributes?["style"],
              styleAttributes.count > 0 else {
            // No
            return markupStyle
        }

        // Has Style Attributes
        // Split the Style Value string into an array
        // font-size:14px;color:red -> ["font-size":"14px","color":"red"]
        let styles = styleString.split(separator: ";").filter { $0.trimmingCharacters(in: .whitespacesAndNewlines) != "" }.map { $0.split(separator: ":") }
        
        for style in styles {
            guard style.count == 2 else {
                continue
            }
            // e.g font-size
            let key = style[0].trimmingCharacters(in: .whitespacesAndNewlines)
            // e.g. 14px
            let value = style[1].trimmingCharacters(in: .whitespacesAndNewlines)
            
            if let styleAttribute = styleAttributes.first(where: { $0.isEqualTo(styleName: key) }) {
                // Use the HTMLTagStyleAttributeToMarkupStyleVisitor mentioned above to convert back to MarkupStyle
                let visitor = HTMLTagStyleAttributeToMarkupStyleVisitor(value: value)
                if var thisMarkupStyle = visitor.visit(styleAttribute: styleAttribute) {
                    // When Style Attribute has a return value..
                    // Merge the result of the previous MarkupStyle
                    thisMarkupStyle.fillIfNil(from: markupStyle)
                    markupStyle = thisMarkupStyle
                }
            }
        }
        
        // If there is a default Style
        if var defaultStyle = defaultStyle {
            switch policy {
                case .respectMarkupStyleFromHTMLStyleAttribute:
                  // Prioritize Style Attribute MarkupStyle, then
                  // Merge the result of defaultStyle
                    markupStyle?.fillIfNil(from: defaultStyle)
                case .respectMarkupStyleFromCode:
                  // Prioritize defaultStyle, then
                  // Merge the result of Style Attribute MarkupStyle
                  defaultStyle.fillIfNil(from: markupStyle)
                  markupStyle = defaultStyle
            }
        }
        
        return markupStyle
    }
}

Corresponding implementation in the source code HTMLTagAttributeToMarkupStyleVisitor.swift

We will define some default styles in MarkupStyle. Some Markup will use the default style if the desired style is not specified from outside the code.

There are two style inheritance strategies:

  • respectMarkupStyleFromCode: Use the default style as the primary; then see what styles can be supplemented from the Style Attributes, ignoring if there is already a value.
  • respectMarkupStyleFromHTMLStyleAttribute: Use the Style Attributes as the primary; then see what styles can be supplemented from the default style, ignoring if there is already a value.

HTMLElementWithMarkupToMarkupStyleProcessor

Convert the Normalization result into AST & MarkupStyleComponent.

Declare a new MarkupComponent to store the corresponding MarkupStyle:

1
2
3
4
5
6
7
8
9
10
struct MarkupStyleComponent: MarkupComponent {
    typealias T = MarkupStyle
    
    let markup: Markup
    let value: MarkupStyle
    init(markup: Markup, value: MarkupStyle) {
        self.markup = markup
        self.value = value
    }
}

Simple traversal of the Markup Tree & HTMLElementMarkupComponent structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let styleAttributes: [HTMLTagStyleAttribute]
let policy: MarkupStylePolicy
    
func process(from: (Markup, [HTMLElementMarkupComponent])) -> [MarkupStyleComponent] {
  var components: [MarkupStyleComponent] = []
  let visitor = HTMLElementMarkupComponentMarkupStyleVisitor(policy: policy, components: from.1, styleAttributes: styleAttributes)
  walk(markup: from.0, visitor: visitor, components: &components)
  return components
}
    
func walk(markup: Markup, visitor: HTMLElementMarkupComponentMarkupStyleVisitor, components: inout [MarkupStyleComponent]) {
        
  if let markupStyle = visitor.visit(markup: markup) {
    components.append(.init(markup: markup, value: markupStyle))
  }
        
  for markup in markup.childMarkups {
    walk(markup: markup, visitor: visitor, components: &components)
  }
}

// print(components)
// [(markup: LinkMarkup, MarkupStyle(link: https://zhgchg.li, color: .blue)]
// [(markup: BoldMarkup, MarkupStyle(font: .init(weight: .bold))]

Corresponding implementation in the original code HTMLElementWithMarkupToMarkupStyleProcessor.swift

The process result is shown in the above image

The process result is shown in the above image

Render — Convert To NSAttributedString

Now that we have the HTML Tag abstract tree structure and the MarkupStyle corresponding to the HTML Tag, the final step is to produce the final NSAttributedString rendering result.

MarkupNSAttributedStringVisitor

visit markup to NSAttributedString

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
struct MarkupNSAttributedStringVisitor: MarkupVisitor {
    typealias Result = NSAttributedString
    
    let components: [MarkupStyleComponent]
    // root / base MarkupStyle, specified externally, for example, the size of the entire string
    let rootStyle: MarkupStyle?
    
    func visit(_ markup: RootMarkup) -> Result {
        // Look down to the RawString object
        return collectAttributedString(markup)
    }
    
    func visit(_ markup: RawStringMarkup) -> Result {
        // Return Raw String
        // Collect all MarkupStyles in the chain
        // Apply Style to NSAttributedString
        return applyMarkupStyle(markup.attributedString, with: collectMarkupStyle(markup))
    }
    
    func visit(_ markup: BoldMarkup) -> Result {
        // Look down to the RawString object
        return collectAttributedString(markup)
    }
    
    func visit(_ markup: LinkMarkup) -> Result {
        // Look down to the RawString object
        return collectAttributedString(markup)
    }
    // ...
}

private extension MarkupNSAttributedStringVisitor {
    // Apply Style to NSAttributedString
    func applyMarkupStyle(_ attributedString: NSAttributedString, with markupStyle: MarkupStyle?) -> NSAttributedString {
        guard let markupStyle = markupStyle else { return attributedString }
        let mutableAttributedString = NSMutableAttributedString(attributedString: attributedString)
        mutableAttributedString.addAttributes(markupStyle.render(), range: NSMakeRange(0, mutableAttributedString.string.utf16.count))
        return mutableAttributedString
    }

    func collectAttributedString(_ markup: Markup) -> NSMutableAttributedString {
        // collect from downstream
        // Root -> Bold -> String("Bold")
        //      \
        //       > String("Test")
        // Result: Bold Test
        // Recursively visit and combine the final NSAttributedString by looking down layer by layer for raw strings
        return markup.childMarkups.compactMap({ visit(markup: $0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
            partialResult.append(attributedString)
            return partialResult
        }
    }
    
    func collectMarkupStyle(_ markup: Markup) -> MarkupStyle? {
        // collect from upstream
        // String("Test") -> Bold -> Italic -> Root
        // Result: style: Bold+Italic
        // Inherit styles layer by layer by looking up for parent tag's markupstyle
        var currentMarkup: Markup? = markup.parentMarkup
        var currentStyle = components.value(markup: markup)
        while let thisMarkup = currentMarkup {
            guard let thisMarkupStyle = components.value(markup: thisMarkup) else {
                currentMarkup = thisMarkup.parentMarkup
                continue
            }

            if var thisCurrentStyle = currentStyle {
                thisCurrentStyle.fillIfNil(from: thisMarkupStyle)
                currentStyle = thisCurrentStyle
            } else {
                currentStyle = thisMarkupStyle
            }

            currentMarkup = thisMarkup.parentMarkup
        }
        
        if var currentStyle = currentStyle {
            currentStyle.fillIfNil(from: rootStyle)
            return currentStyle
        } else {
            return rootStyle
        }
    }
}

Corresponding implementation in the source code MarkupNSAttributedStringVisitor.swift

Operation process and result as shown in the figure

Operation process and result as shown in the figure

Finally, we can get:

1
2
3
4
5
6
7
8
9
10
11
Li{
    NSColor = "Blue";
    NSFont = "<UICTFont: 0x145d17600> font-family: \".SFUI-Regular\"; font-weight: normal; font-style: normal; font-size: 13.00pt";
    NSLink = "https://zhgchg.li";
}nk{
    NSColor = "Blue";
    NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
    NSLink = "https://zhgchg.li";
}Bold{
    NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
}

🎉🎉🎉🎉Completed🎉🎉🎉🎉

At this point, we have completed the entire conversion process from HTML String to NSAttributedString.

Stripper — Stripping HTML Tags

Stripping HTML tags is relatively simple, just need to:

1
2
3
4
5
6
7
8
9
10
func attributedString(_ markup: Markup) -> NSAttributedString {
  if let rawStringMarkup = markup as? RawStringMarkup {
    return rawStringMarkup.attributedString
  } else {
    return markup.childMarkups.compactMap({ attributedString($0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
      partialResult.append(attributedString)
      return partialResult
    }
  }
}

Corresponding implementation in the source code MarkupStripperProcessor.swift

Similar to Render, but purely returns the content after finding RawStringMarkup.

Extend — Dynamic Extension

To extend and cover all HTMLTag/Style Attributes, a dynamic extension port is opened, making it convenient to dynamically extend objects directly from the code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public struct ExtendTagName: HTMLTagName {
    public let string: String
    
    public init(_ w3cHTMLTagName: WC3HTMLTagName) {
        self.string = w3cHTMLTagName.rawValue
    }
    
    public init(_ string: String) {
        self.string = string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
    }
    
    public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
        return visitor.visit(self)
    }
}
// to
final class ExtendMarkup: Markup {
    weak var parentMarkup: Markup? = nil
    var childMarkups: [Markup] = []

    func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
        return visitor.visit(self)
    }
}

//----

public struct ExtendHTMLTagStyleAttribute: HTMLTagStyleAttribute {
    public let styleName: String
    public let render: ((String) -> (MarkupStyle?)) // Dynamically change MarkupStyle using closure
    
    public init(styleName: String, render: @escaping ((String) -> (MarkupStyle?))) {
        self.styleName = styleName
        self.render = render
    }
    
    public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagStyleAttributeVisitor {
        return visitor.visit(self)
    }
}

ZHTMLParserBuilder

Finally, we use the Builder Pattern to allow external Modules to quickly construct the objects required by ZMarkupParser and ensure Access Level Control.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
public final class ZHTMLParserBuilder {
    
    private(set) var htmlTags: [HTMLTag] = []
    private(set) var styleAttributes: [HTMLTagStyleAttribute] = []
    private(set) var rootStyle: MarkupStyle?
    private(set) var policy: MarkupStylePolicy = .respectMarkupStyleFromCode
    
    public init() {
        
    }
    
    public static func initWithDefault() -> Self {
        var builder = Self.init()
        for htmlTagName in ZHTMLParserBuilder.htmlTagNames {
            builder = builder.add(htmlTagName)
        }
        for styleAttribute in ZHTMLParserBuilder.styleAttributes {
            builder = builder.add(styleAttribute)
        }
        return builder
    }
    
    public func set(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle?) -> Self {
        return self.add(htmlTagName, withCustomStyle: markupStyle)
    }
    
    public func add(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle? = nil) -> Self {
        // Only one tagName can exist
        htmlTags.removeAll { htmlTag in
            return htmlTag.tagName.string == htmlTagName.string
        }
        
        htmlTags.append(HTMLTag(tagName: htmlTagName, customStyle: markupStyle))
        
        return self
    }
    
    public func add(_ styleAttribute: HTMLTagStyleAttribute) -> Self {
        styleAttributes.removeAll { thisStyleAttribute in
            return thisStyleAttribute.styleName == styleAttribute.styleName
        }
        
        styleAttributes.append(styleAttribute)
        
        return self
    }
    
    public func set(rootStyle: MarkupStyle) -> Self {
        self.rootStyle = rootStyle
        return self
    }
    
    public func set(policy: MarkupStylePolicy) -> Self {
        self.policy = policy
        return self
    }
    
    public func build() -> ZHTMLParser {
        // ZHTMLParser init is only open for internal use, external cannot directly init
        // Can only be initialized through ZHTMLParserBuilder
        return ZHTMLParser(htmlTags: htmlTags, styleAttributes: styleAttributes, policy: policy, rootStyle: rootStyle)
    }
}

Corresponding implementation in ZHTMLParserBuilder.swift

initWithDefault will add all implemented HTMLTagName/Style Attribute by default

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public extension ZHTMLParserBuilder {
    static var htmlTagNames: [HTMLTagName] {
        return [
            A_HTMLTagName(),
            B_HTMLTagName(),
            BR_HTMLTagName(),
            DIV_HTMLTagName(),
            HR_HTMLTagName(),
            I_HTMLTagName(),
            LI_HTMLTagName(),
            OL_HTMLTagName(),
            P_HTMLTagName(),
            SPAN_HTMLTagName(),
            STRONG_HTMLTagName(),
            U_HTMLTagName(),
            UL_HTMLTagName(),
            DEL_HTMLTagName(),
            TR_HTMLTagName(),
            TD_HTMLTagName(),
            TH_HTMLTagName(),
            TABLE_HTMLTagName(),
            IMG_HTMLTagName(handler: nil),
            // ...
        ]
    }
}

public extension ZHTMLParserBuilder {
    static var styleAttributes: [HTMLTagStyleAttribute] {
        return [
            ColorHTMLTagStyleAttribute(),
            BackgroundColorHTMLTagStyleAttribute(),
            FontSizeHTMLTagStyleAttribute(),
            FontWeightHTMLTagStyleAttribute(),
            LineHeightHTMLTagStyleAttribute(),
            WordSpacingHTMLTagStyleAttribute(),
            // ...
        ]
    }
}

ZHTMLParser init is only open internally, external cannot directly init, can only init through ZHTMLParserBuilder.

ZHTMLParser encapsulates Render/Selector/Stripper operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
public final class ZHTMLParser: ZMarkupParser {
    let htmlTags: [HTMLTag]
    let styleAttributes: [HTMLTagStyleAttribute]
    let rootStyle: MarkupStyle?

    internal init(...) {
    }
    
    // Get link style attributes
    public var linkTextAttributes: [NSAttributedString.Key: Any] {
        // ...
    }
    
    public func selector(_ string: String) -> HTMLSelector {
        // ...
    }
    
    public func selector(_ attributedString: NSAttributedString) -> HTMLSelector {
        // ...
    }
    
    public func render(_ string: String) -> NSAttributedString {
        // ...
    }
    
    // Allow rendering of NSAttributedString within nodes using HTMLSelector results
    public func render(_ selector: HTMLSelector) -> NSAttributedString {
        // ...
    }
    
    public func render(_ attributedString: NSAttributedString) -> NSAttributedString {
        // ...
    }
    
    public func stripper(_ string: String) -> String {
        // ...
    }
    
    public func stripper(_ attributedString: NSAttributedString) -> NSAttributedString {
        // ...
    }
    
  // ...
}

Corresponding implementation in the original code ZHTMLParser.swift

UIKit Issues

The result of NSAttributedString is most commonly displayed in a UITextView, but note:

  • The link style in UITextView is uniformly determined by the linkTextAttributes setting, not by the NSAttributedString.Key setting, and individual styles cannot be set; hence the ZMarkupParser.linkTextAttributes opening.
  • UILabel currently has no way to change the link style, and since UILabel does not have TextStorage, if you want to load NSTextAttachment images, you need to handle UILabel separately.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public extension UITextView {
    func setHtmlString(_ string: String, with parser: ZHTMLParser) {
        self.setHtmlString(NSAttributedString(string: string), with: parser)
    }
    
    func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
        self.attributedText = parser.render(string)
        self.linkTextAttributes = parser.linkTextAttributes
    }
}
public extension UILabel {
    func setHtmlString(_ string: String, with parser: ZHTMLParser) {
        self.setHtmlString(NSAttributedString(string: string), with: parser)
    }
    
    func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
        let attributedString = parser.render(string)
        attributedString.enumerateAttribute(NSAttributedString.Key.attachment, in: NSMakeRange(0, attributedString.string.utf16.count), options: []) { (value, effectiveRange, nil) in
            guard let attachment = value as? ZNSTextAttachment else {
                return
            }
            
            attachment.register(self)
        }
        
        self.attributedText = attributedString
    }
}

Therefore, by extending UIKit, external users only need to use setHTMLString() to complete the binding.

Complex Rendering Items— List Items

Record of implementing list items.

Using <ol> / <ul> to wrap <li> in HTML to represent list items:

1
2
3
4
5
6
<ul>
    <li>ItemA</li>
    <li>ItemB</li>
    <li>ItemC</li>
    //...
</ul>

Using the same parsing method as before, we can get other list items in visit(_ markup: ListItemMarkup) to know the current list index (thanks to converting to AST).

1
2
3
4
func visit(_ markup: ListItemMarkup) -> Result {
  let siblingListItems = markup.parentMarkup?.childMarkups.filter({ $0 is ListItemMarkup }) ?? []
  let position = (siblingListItems.firstIndex(where: { $0 === markup }) ?? 0)
}

NSParagraphStyle has an NSTextList object that can be used to display list items, but in practice, it cannot customize the width of the whitespace (personally, I think the whitespace is too large). If there is whitespace between the bullet and the string, it will trigger a line break here, making the display look a bit odd, as shown in the image below:

The Better part can potentially be achieved by setting headIndent, firstLineHeadIndent, NSTextTab, but testing shows that if the string is too long or the size changes, it still cannot perfectly present the result.

Currently, it is only Acceptable, combining the list item string and inserting it before the string.

We only use NSTextList.MarkerFormat to generate list item symbols, rather than directly using NSTextList.

For a list of supported list symbols, refer to: MarkupStyleList.swift

Final display result: <ol><li>

Complex Rendering Items — Table

Similar to the implementation of list items, but for tables.

Using <table> in HTML to create a table -> wrapping <tr> table rows -> wrapping <td>/<th> to represent table cells:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>
  <tr>
    <td>Centro comercial Moctezuma</td>
    <td>Francisco Chang</td>
    <td>Mexico</td>
  </tr>
</table>

Testing shows that the native NSAttributedString.DocumentType.html uses the Private macOS API NSTextBlock to complete the display, thus it can fully display the HTML table style and content.

A bit of cheating! We can’t use Private API 🥲

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
    func visit(_ markup: TableColumnMarkup) -> Result {
        let attributedString = collectAttributedString(markup)
        let siblingColumns = markup.parentMarkup?.childMarkups.filter({ $0 is TableColumnMarkup }) ?? []
        let position = (siblingColumns.firstIndex(where: { $0 === markup }) ?? 0)
        
        // Whether to specify the desired width externally, can set .max to not truncate string
        var maxLength: Int? = markup.fixedMaxLength
        if maxLength == nil {
            // If not specified, find the string length of the same column in the first row as the max length
            if let tableRowMarkup = markup.parentMarkup as? TableRowMarkup,
               let firstTableRow = tableRowMarkup.parentMarkup?.childMarkups.first(where: { $0 is TableRowMarkup }) as? TableRowMarkup {
                let firstTableRowColumns = firstTableRow.childMarkups.filter({ $0 is TableColumnMarkup })
                if firstTableRowColumns.indices.contains(position) {
                    let firstTableRowColumnAttributedString = collectAttributedString(firstTableRowColumns[position])
                    let length = firstTableRowColumnAttributedString.string.utf16.count
                    maxLength = length
                }
            }
        }
        
        if let maxLength = maxLength {
            // If the field exceeds maxLength, truncate the string
            if attributedString.string.utf16.count > maxLength {
                attributedString.mutableString.setString(String(attributedString.string.prefix(maxLength))+"...")
            } else {
                attributedString.mutableString.setString(attributedString.string.padding(toLength: maxLength, withPad: " ", startingAt: 0))
            }
        }
        
        if position < siblingColumns.count - 1 {
            // Add spaces as spacing, the width of the spacing can be specified externally
            attributedString.append(makeString(in: markup, string: String(repeating: " ", count: markup.spacing)))
        }
        
        return attributedString
    }
    
    func visit(_ markup: TableRowMarkup) -> Result {
        let attributedString = collectAttributedString(markup)
        attributedString.append(makeBreakLine(in: markup)) // Add line break, for details refer to Source Code
        return attributedString
    }
    
    func visit(_ markup: TableMarkup) -> Result {
        let attributedString = collectAttributedString(markup)
        attributedString.append(makeBreakLine(in: markup)) // Add line break, for details refer to Source Code
        attributedString.insert(makeBreakLine(in: markup), at: 0) // Add line break, for details refer to Source Code
        return attributedString
    }

The final presentation effect is as follows:

not perfect, but acceptable.

Complex Rendering Items — Image

Finally, let’s talk about the biggest challenge, loading remote images into NSAttributedString.

Use <img> to represent images in HTML:

1
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg" width="300" height="125"/>

You can specify the desired display size through the width / height HTML attributes.

Displaying images in NSAttributedString is much more complicated than imagined; and there is no good implementation. Previously, when doing UITextView text wrapping, I encountered some pitfalls, but after researching again, I found that there is still no perfect solution.

For now, let’s ignore the issue that NSTextAttachment natively cannot reuse and release memory. We will first implement downloading images from remote and placing them into NSTextAttachment, then into NSAttributedString, and achieve automatic content updates.

This series of operations is split into another small project for better optimization and reuse in other projects in the future:

Mainly referring to Asynchronous NSTextAttachments series of articles for implementation, but replacing the final content update part (refreshing the UI after downloading) and adding Delegate/DataSource for external extension use.

Operation flow and relationship as shown in the figure above

Operation flow and relationship as shown in the figure above:

  • Declare ZNSTextAttachmentable object, encapsulating NSTextStorage object (UITextView built-in) and UILabel itself (UILabel has no NSTextStorage) The operation method is only to implement replace attributedString from NSRange. (func replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment))
  • The principle is to use ZNSTextAttachment to package imageURL, PlaceholderImage, and the size information to be displayed, then directly display the image with placeHolder
  • When the system needs this image on the screen, it will call the image(forBounds… method, at which point we start downloading the Image Data
  • DataSource goes out to let the external decide how to download or implement the Image Cache Policy, by default directly using URLSession to request image Data
  • After downloading, create a new ZResizableNSTextAttachment and implement the custom image size logic in attachmentBounds(for…
  • Call the replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment) method to replace the ZNSTextAttachment position with ZResizableNSTextAttachment
  • Issue didLoad Delegate notification, allowing external connection if needed
  • Complete

For detailed code, refer to Source Code.

The reason for not using NSLayoutManager.invalidateLayout(forCharacterRange: range, actualCharacterRange: nil) or NSLayoutManager.invalidateDisplay(forCharacterRange: range) to refresh the UI is that the UI did not correctly display the update; since the Range is known, directly triggering the replacement of NSAttributedString ensures the UI is correctly updated.

The final display result is as follows:

1
2
<span style="color:red">こんにちは</span>こんにちはこんにちは <br />
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg"/>

Testing & Continuous Integration

In this project, in addition to writing Unit Tests, Snapshot Tests were also established for integration testing to facilitate comprehensive testing and comparison of the final NSAttributedString.

The main functional logic has UnitTests and integration tests. The final Test Coverage is around 85%.

[ZMarkupParser — codecov](https://app.codecov.io/gh/ZhgChgLi/ZMarkupParser){:target="_blank"}

ZMarkupParser — codecov

Snapshot Test

Directly use the framework:

1
2
3
4
5
6
7
8
9
10
11
12
13
import SnapshotTesting
// ...
func testShouldKeppNSAttributedString() {
  let parser = ZHTMLParserBuilder.initWithDefault().build()
  let textView = UITextView()
  textView.frame.size.width = 390
  textView.isScrollEnabled = false
  textView.backgroundColor = .white
  textView.setHtmlString("html string...", with: parser)
  textView.layoutIfNeeded()
  assertSnapshot(matching: textView, as: .image, record: false)
}
// ...

Directly compare the final result to see if it meets expectations, ensuring that the integration adjustments are not abnormal.

Codecov Test Coverage

Integrate Codecov.io (free for Public Repo) to evaluate Test Coverage. Just install the Codecov Github App & configure it.

After setting up Codecov <-> Github Repo, you can also add codecov.yml to the root directory of the project

1
2
3
4
5
6
comment:                  # this is a top-level key
  layout: "reach, diff, flags, files"
  behavior: default
  require_changes: false  # if true: only post the comment if coverage changes
  require_base: no        # [yes :: must have a base report to post]
  require_head: yes       # [yes :: must have a head report to post]

Configuration file, this can enable the CI results to be automatically commented on the content after each PR is issued.

Continuous Integration

Github Action, CI integration: ci.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
name: CI

on:
  workflow_dispatch:
  pull_request:
    types: [opened, reopened]
  push:
    branches:
    - main

jobs:
  build:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v3
      - name: spm build and test
        run: |
          set -o pipefail
          xcodebuild test -workspace ZMarkupParser.xcworkspace -testPlan ZMarkupParser -scheme ZMarkupParser -enableCodeCoverage YES -resultBundlePath './scripts/TestResult.xcresult' -destination 'platform=iOS Simulator,name=iPhone 14,OS=16.1' build test | xcpretty
      - name: Codecov
        uses: codecov/codecov-action@v3.1.1
        with:
          xcode: true
          xcode_archive_path: './scripts/TestResult.xcresult'

This configuration runs build and test when PR is opened/reopened or push to the main branch, and finally uploads the test coverage report to codecov.

Regex

Regarding regular expressions, each use improves it further; this time, not much was used, but because I originally wanted to use a regex to extract paired HTML Tags, I also studied how to write it.

Some new cheat sheet notes learned this time…

  • ?: allows ( ) to match group results but not capture them e.g. (?:https?:\/\/)?(?:www\.)?example\.com will return the entire URL in https://www.example.com instead of https://, www
  • .+? non-greedy match (returns the nearest) e.g. <.+?> will return <a>, </a> in <a>test</a> instead of the entire string
  • (?=XYZ) any string until the XYZ string appears; note that another similar one [^XYZ] means any string until X or Y or Z character appears e.g. (?:__)(.+?(?=__))(?:__) (any string until __) will match test
  • ?R recursively finds values with the same rule e.g. \((?:[^()]|((?R)))+\) will match (simple), (and(nested)), (nested) in (simple) (and(nested))
  • ?<GroupName>\k<GroupName> matches the previous Group Name e.g. (?<tagName><a>).*(\k<GroupName>)
  • (?(X)yes|no) matches the condition yes if the X match result has a value (can also use Group Name), otherwise matches no Swift does not support this yet

Other good articles on Regex:

Swift Package Manager & Cocoapods

This is also my first time developing with SPM & Cocoapods… It’s quite interesting, SPM is really convenient; but if you encounter a situation where two projects depend on the same package, opening both projects at the same time will result in one of them not finding the package and failing to build…

Cocoapods has uploaded ZMarkupParser but hasn’t tested if it’s working properly, because I’m using SPM 😝.

ChatGPT

From the actual development experience, I found it most useful only when assisting in editing the Readme; in development, I haven’t felt any significant impact yet. When asking mid-senior level questions, it couldn’t provide a definite answer or even gave incorrect answers (I encountered some incorrect regex rules). So, in the end, I still turned to Google for the correct answers.

Not to mention asking it to write code, unless it’s simple Code Gen Object; otherwise, don’t expect it to complete the entire tool architecture directly. (At least for now, it seems that Copilot might be more helpful for writing code)

However, it can provide a general direction for knowledge blind spots, allowing us to quickly get a rough idea of how certain things should be done. Sometimes, when the understanding is too low, it’s hard to quickly pinpoint the correct direction on Google, and that’s when ChatGPT is quite helpful.

Disclaimer

After more than three months of research and development, I am exhausted, but I still need to declare that this approach is only a feasible result obtained after my research. It is not necessarily the best solution, and there may still be areas for optimization. This project is more like a starting point, hoping to get a perfect solution for Markup Language to NSAttributedString. Everyone is very welcome to contribute; many things still need the power of the community to be perfected.

Contributing

[ZMarkupParser](https://github.com/ZhgChgLi/ZMarkupParser){:target="_blank"} [⭐](https://github.com/ZhgChgLi/ZMarkupParser){:target="_blank"}

ZMarkupParser

Here are some areas that I think can be improved as of now (2023/03/12), and will be recorded in the Repo later:

  1. Optimization of performance/algorithm, although it is faster and more stable than the native NSAttributedString.DocumentType.html; there is still much room for optimization. I believe the performance is definitely not as good as XMLParser; I hope one day it can have the same performance while maintaining customization and automatic error correction.
  2. Support for more HTML Tag, Style Attribute conversion parsing
  3. Further optimization of ZNSTextAttachment, implementing reuse capability, releasing memory; may need to research CoreText
  4. Support for Markdown parsing, as the underlying abstraction is not limited to HTML; so as long as the front-end Markdown to Markup object is built, Markdown parsing can be completed; hence I named it ZMarkupParser, not ZHTMLParser, hoping that one day it can also support Markdown to NSAttributedString
  5. Support for Any to Any, e.g. HTML To Markdown, Markdown To HTML, as we have the original AST tree (Markup object), so achieving conversion between any Markup is possible
  6. Implement css !important functionality, enhancing the inheritance strategy of abstract MarkupStyle
  7. Enhance HTML Selector functionality, currently, it is just the most basic filter functionality
  8. Many more, welcome to open issue

If you are willing but unable to contribute, you can also give me a ⭐ to make the Repo more visible, so that GitHub experts have the opportunity to help contribute!

Summary

[ZMarkupParser](https://github.com/ZhgChgLi/ZMarkupParser){:target="_blank"}

ZMarkupParser

Here are all the technical details and the journey of developing ZMarkupParser. It took me almost three months of after-work and weekend time, countless research and practice, writing tests, improving Test Coverage, and setting up CI; finally, there is a somewhat decent result. I hope this tool solves the same problems for others and that everyone can help make this tool even better.

[pinkoi.com](https://www.pinkoi.com){:target="_blank"}

pinkoi.com

It is currently applied in our company’s pinkoi.com iOS App version, and no issues have been found. 😄

Further Reading

If you have any questions or feedback, feel free to contact me.

===

本文中文版本

===

This article was first published in Traditional Chinese on Medium ➡️ View Here


This post is licensed under CC BY 4.0 by the author.

ZMarkupParser HTML String to NSAttributedString Tool

ZMediumToJekyll