Home iOS Vision framework x WWDC 24 Discover Swift enhancements in the Vision framework Session
Post
Cancel

iOS Vision framework x WWDC 24 Discover Swift enhancements in the Vision framework Session

iOS Vision framework x WWDC 24 Discover Swift enhancements in the Vision framework Session

Vision framework review & trying out new Swift API in iOS 18

Photo by [BoliviaInteligente](https://unsplash.com/@boliviainteligente?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash){:target="_blank"}

Photo by BoliviaInteligente

Topic

The relationship with Vision Pro is like the relationship between hot dogs and dogs, completely unrelated.

The relationship with Vision Pro is like the relationship between hot dogs and dogs, completely unrelated.

Vision framework

The Vision framework is Apple’s integrated image recognition framework for machine learning, allowing developers to easily and quickly implement common image recognition functions. The Vision framework was introduced as early as iOS 11.0+ (2017/iPhone 8) and has been continuously iterated and optimized. It enhances performance by integrating features with Swift Concurrency and provides a new Swift Vision framework API from iOS 18.0 to maximize the benefits of Swift Concurrency.

Features of Vision framework

  • Built-in numerous image recognition and motion tracking methods (up to 31 as of iOS 18)
  • On-Device computation using only the phone’s chip, independent of cloud services, fast and secure
  • Simple and easy-to-use API
  • Apple supports all platforms: iOS 11.0+, iPadOS 11.0+, Mac Catalyst 13.0+, macOS 10.13+, tvOS 11.0+, visionOS 1.0+
  • Released for multiple years (2017-present) and continuously updated
  • Enhances computational performance by integrating Swift language features

Played around 6 years ago: Exploring Vision - Automatically Recognizing Faces for App Avatar Cropping (Swift)

This time, in conjunction with WWDC 24 Discover Swift enhancements in the Vision framework Session, revisiting and combining new Swift features to play again.

CoreML

Apple also has another framework called CoreML, which is a machine learning framework based on On-Device chips. It allows you to train models for objects or documents you want to recognize and use the models directly in the app. Interested friends can also give it a try. (e.g. Real-time article classification, real-time spam message detection …)

p.s.

Vision v.s. VisionKit:

Vision: Mainly used for image analysis tasks such as face recognition, barcode detection, text recognition, etc. It provides powerful APIs to handle and analyze visual content in static images or videos.

VisionKit: Specifically designed for tasks related to document scanning. It offers a scanner view controller that can be used to scan documents and generate high-quality PDFs or images.

The Vision framework cannot run on the M1 model in the simulator, it can only be tested on a physical device; running in a simulator environment will throw a Could not create Espresso context error, no solution found in the official forum discussion.

Since I don’t have a physical iOS 18 device for testing, all the execution results in this article are based on the old (pre-iOS 18) syntax; please leave a comment if there are errors with the new syntax.

WWDC 2024 — Discover Swift enhancements in the Vision framework

Discover Swift enhancements in the Vision framework

Discover Swift enhancements in the Vision framework

This article is a sharing note for WWDC 24 — Discover Swift enhancements in the Vision framework session, along with some experimental insights.

Introduction — Vision framework Features

Face recognition, contour recognition

Text recognition in image content

As of iOS 18, it supports 18 languages.

1
2
3
4
5
6
7
8
9
10
11
// Supported language list
if #available(iOS 18.0, *) {
  print(RecognizeTextRequest().supportedRecognitionLanguages.map { "\($0.languageCode!)-\(($0.region?.identifier ?? $0.script?.identifier)!)" })
} else {
  print(try! VNRecognizeTextRequest().supportedRecognitionLanguages())
}

// The actual available recognition languages are based on this.
// Tested on iOS 18, the output is as follows:
// ["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant", "yue-Hans", "yue-Hant", "ko-KR", "ja-JP", "ru-RU", "uk-UA", "th-TH", "vi-VT", "ar-SA", "ars-SA"]
// Swedish language mentioned in WWDC was not seen, unsure if it has not been released yet or is related to device region and language settings

Dynamic motion capture

  • Can achieve dynamic capture of people and objects
  • Gesture capture implements air signature function

What’s new in Vision? (iOS 18)— Image rating feature (quality, key points)

  • Calculate scores for input images to easily filter out high-quality photos
  • The scoring method includes multiple dimensions, not just image quality, but also lighting, angles, shooting subjects, whether there are memorable points … and so on

WWDC provided the above three images for explanation (under the same image quality), which are:

  • High-scoring image: composition, lighting, memorable points
  • Low-scoring image: no main subject, looks like taken casually or accidentally
  • Utility image: technically well-taken but lacks memorable points, like images used for stock photo libraries

iOS ≥ 18 New API: CalculateImageAestheticsScoresRequest

1
2
3
4
5
6
7
8
let request = CalculateImageAestheticsScoresRequest()
let result = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)

// Photo score
print(result.overallScore)

// Whether it is judged as a utility image
print(result.isUtility)

What’s new in Vision? (iOS 18) — Simultaneous detection of body and gesture poses

In the past, only body pose and hand pose could be detected separately.

With this update, developers can detect both body and hand poses simultaneously, combining them into a single request and result, making it more convenient for further feature development.

iOS ≥ 18 New API: DetectHumanBodyPoseRequest

1
2
3
4
5
6
7
8
9
10
11
12
var request = DetectHumanBodyPoseRequest()
// Detect hand pose together
request.detectsHands = true

guard let bodyPose = try await request.perform(on: image). first else { return }

// Body Pose Joints
let bodyJoints = bodyPose.allJoints()
// Left hand Pose Joints
let leftHandJoints = bodyPose.leftHand.allJoints()
// Right hand Pose Joints
let rightHandJoints = bodyPose.rightHand.allJoints()

New Vision API

Apple provides new Swift Vision API wrappers for developers in this update, in addition to basic support for existing functionalities, mainly focusing on enhancing Swift 6 / Swift Concurrency features, providing more efficient and Swift-like API operation methods.

Get started with Vision

The speaker here reintroduced the basic usage of the Vision framework. Apple has encapsulated 31 types of common image recognition requests and their corresponding “Observation” objects (as of iOS 18).

  1. Request: DetectFaceRectanglesRequest - Face area recognition request Result: FaceObservation The previous article “Exploring Vision - Automatically Identify Faces for Avatar Upload in Apps (Swift)” used this pair of requests.

  2. Request: RecognizeTextRequest - Text recognition request Result: RecognizedTextObservation

  3. Request: GenerateObjectnessBasedSaliencyImageRequest - Objectness-based object recognition request Result: SaliencyImageObservation

All 31 types of requests:

VisionRequest.

Request PurposeObservation Description
CalculateImageAestheticsScoresRequest
Calculate the aesthetic score of the image.
AestheticsObservation
Returns the aesthetic score of the image, considering factors like composition and color.
ClassifyImageRequest
Classify the content of the image.
ClassificationObservation
Returns the classification labels and confidence of objects or scenes in the image.
CoreMLRequest
Analyze images using Core ML models.
CoreMLFeatureValueObservation
Generates observations based on the output of Core ML models.
DetectAnimalBodyPoseRequest
Detect animal poses in images.
RecognizedPointsObservation
Returns the skeleton points and their positions of animals.
DetectBarcodesRequest
Detect barcodes in images.
BarcodeObservation
Returns barcode data and types (e.g., QR code).
DetectContoursRequest
Detect contours in images.
ContoursObservation
Returns detected contour lines in the image.
DetectDocumentSegmentationRequest
Detect and segment documents in images.
RectangleObservation
Returns the rectangular boundary positions of documents.
DetectFaceCaptureQualityRequest
Evaluate the quality of face captures.
FaceObservation
Returns quality assessment scores for facial images.
DetectFaceLandmarksRequest
Detect facial landmarks.
FaceObservation
Returns detailed positions of facial landmarks (e.g., eyes, nose).
DetectFaceRectanglesRequest
Detect faces in images.
FaceObservation
Returns the bounding box positions of faces.
DetectHorizonRequest
Detect horizons in images.
HorizonObservation
Returns the angle and position of the horizon.
DetectHumanBodyPose3DRequest
Detect 3D human body poses in images.
RecognizedPointsObservation
Returns 3D human skeleton points and their spatial coordinates.
DetectHumanBodyPoseRequest
Detect human body poses in images.
RecognizedPointsObservation
Returns human skeleton points and their coordinates.
DetectHumanHandPoseRequest
Detect hand poses in images.
RecognizedPointsObservation
Returns hand skeleton points and their positions.
DetectHumanRectanglesRequest
Detect humans in images.
HumanObservation
Returns the bounding box positions of humans.
DetectRectanglesRequest
Detect rectangles in images.
RectangleObservation
Returns the coordinates of the four vertices of rectangles.
DetectTextRectanglesRequest
Detect text regions in images.
TextObservation
Returns the positions and bounding boxes of text regions.
DetectTrajectoriesRequest
Detect and analyze object motion trajectories.
TrajectoryObservation
Returns motion trajectory points and their time series.
GenerateAttentionBasedSaliencyImageRequest
Generate attention-based saliency images.
SaliencyImageObservation
Returns saliency maps of the most attractive areas in the image.
GenerateForegroundInstanceMaskRequest
Generate foreground instance mask images.
InstanceMaskObservation
Returns masks of foreground objects.
GenerateImageFeaturePrintRequest
Generate image feature prints for comparison.
FeaturePrintObservation
Returns feature fingerprint data of images for similarity comparison.
GenerateObjectnessBasedSaliencyImageRequest
Generate objectness-based saliency images.
SaliencyImageObservation
Returns saliency maps of object saliency areas.
GeneratePersonInstanceMaskRequest
Generate person instance mask images.
InstanceMaskObservation
Returns masks of person instances.
GeneratePersonSegmentationRequest
Generate person segmentation images.
SegmentationObservation
Returns binary images of person segmentation.
RecognizeAnimalsRequest
Detect and identify animals in images.
RecognizedObjectObservation
Returns animal types and their confidence levels.
RecognizeTextRequest
Detect and identify text in images.
RecognizedTextObservation
Returns detected text content and its spatial positions.
TrackHomographicImageRegistrationRequest
Track homographic image registration.
ImageAlignmentObservation
Returns homographic transformation matrices between images for image registration.
TrackObjectRequest
Track objects in images.
DetectedObjectObservation
Returns the positions and velocity information of objects in images.
TrackOpticalFlowRequest
Track optical flow in images.
OpticalFlowObservation
Returns optical flow vector fields describing pixel movements.
TrackRectangleRequest
Track rectangles in images.
RectangleObservation
Returns the positions, sizes, and rotation angles of rectangles in images.
TrackTranslationalImageRegistrationRequest
Track translational image registration.
ImageAlignmentObservation
Returns translational transformation matrices between images for image registration.
  • Prefixing VN in front is the old API writing method (before iOS 18)

The speaker mentioned several commonly used Requests as follows.

ClassifyImageRequest

Recognize the input image, obtain label classification and confidence.

[Travelogue] 2024 Second Visit to Kyushu 9-Day Free and Easy Trip, Entering Fukuoka by Busan→Hakata Cruise

[Travelogue] 2024 Second Visit to Kyushu 9-Day Free and Easy Trip, Entering Fukuoka by Busan→Hakata Cruise

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = ClassifyImageRequest()
    Task {
        do {
            let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)
            observations.forEach {
                observation in
                print("\(observation.identifier): \(observation.confidence)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old method
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNClassificationObservation] else {
            return
        }
        observations.forEach {
            observation in
            print("\(observation.identifier): \(observation.confidence)")
        }
    }

    let request = VNClassifyImageRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*3_jdrLurFuUfNdW4BJaRww.jpeg")!, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  outdoor: 0.75392926
  sky: 0.75392926
  blue_sky: 0.7519531
  machine: 0.6958008
  cloudy: 0.26538086
  structure: 0.15728651
  sign: 0.14224191
  fence: 0.118652344
  banner: 0.0793457
  material: 0.075975396
  plant: 0.054406323
  foliage: 0.05029297
  light: 0.048126098
  lamppost: 0.048095703
  billboards: 0.040039062
  art: 0.03977703
  branch: 0.03930664
  decoration: 0.036868922
  flag: 0.036865234
....etc

RecognizeTextRequest

Recognize the text content in the image (a.k.a OCR)

[Travelogue] 2023 Tokyo 5-day free trip

[Travelogue] 2023 Tokyo 5-day free trip](../9da2c51fa4f2/)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
if #available(iOS 18.0, *) {
    // New API using Swift features
    var request = RecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.recognitionLanguages = [.init(identifier: "ja-JP"), .init(identifier: "en-US")] // Specify language code, e.g., Traditional Chinese
    Task {
        do {
            let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!)
            observations.forEach {
                observation in
                let topCandidate = observation.topCandidates(1).first
                print(topCandidate?.string ?? "No text recognized")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old way
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedTextObservation] else {
            return
        }
        observations.forEach {
            observation in
            let topCandidate = observation.topCandidates(1).first
            print(topCandidate?.string ?? "No text recognized")
        }
    }

    let request = VNRecognizeTextRequest(completionHandler: completionHandler)
    request.recognitionLevel = .accurate
    request.recognitionLanguages = ["ja-JP", "en-US"] // Specify language code, e.g., Traditional Chinese
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
LE LABO Aoyama Store
TEL:03-6419-7167
*Thank you for your purchase*
No: 21347
Date: 2023/06/10 14.14.57
Responsible:
1690370
Register: 008A 1
Product Name
Tax-inclusive Price Quantity Tax-inclusive Total
Kaiak 10 EDP FB 15ML
J1P7010000S
16,800
16,800
Another 13 EDP FB 15ML
J1PJ010000S
10,700
10,700
Lip Balm 15ML
JOWC010000S
2,000
1
Total Amount
(Tax Included)
CARD
2,000
3 items purchased
29,500
0
29,500
29,500

DetectBarcodesRequest

Detect barcode and QR code data in the image.

Thai locals recommend Goose Brand Cooling Gel

Thai locals recommend Goose Brand Cooling Gel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
let filePath = Bundle.main.path(forResource: "IMG_6777", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = DetectBarcodesRequest()
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            observations.forEach {
                observation in
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old way
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        observations.forEach {
            observation in
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

1
2
3
4
5
6
7
8
Payload: 8859126000911
Symbology: VNBarcodeSymbologyEAN13
Payload: https://lin.ee/hGynbVM
Symbology: VNBarcodeSymbologyQR
Payload: http://www.hongthaipanich.com/
Symbology: VNBarcodeSymbologyQR
Payload: https://www.facebook.com/qr?id=100063856061714
Symbology: VNBarcodeSymbologyQR

RecognizeAnimalsRequest

Recognize animals in the image with confidence.

[meme Source](https://www.redbubble.com/i/canvas-print/Funny-AI-Woman-yelling-at-a-cat-meme-design-Machine-learning-by-omolog/43039298.5Y5V7){:target="_blank"}

meme Source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
let filePath = Bundle.main.path(forResource: "IMG_5026", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = RecognizeAnimalsRequest()
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            observations.forEach {
                observation in
                let labels = observation.labels
                labels.forEach {
                    label in
                    print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
                }
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old way
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedObjectObservation] else {
            return
        }
        observations.forEach {
            observation in
            let labels = observation.labels
            labels.forEach {
                label in
                print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
            }
        }
    }

    let request = VNRecognizeAnimalsRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

1
Detected animal: Cat with confidence: 0.77245045

Others:

  • Detecting human body in images: DetectHumanRectanglesRequest
  • Detecting poses of animals and humans (3D or 2D): DetectAnimalBodyPoseRequest, DetectHumanBodyPose3DRequest, DetectHumanBodyPoseRequest, DetectHumanHandPoseRequest
  • Detecting and tracking object trajectories (in different frames of videos, animations): DetectTrajectoriesRequest, TrackObjectRequest, TrackRectangleRequest

iOS ≥ 18 Update Highlight:

1
2
3
4
VN*Request -> *Request (e.g. VNDetectBarcodesRequest -> DetectBarcodesRequest)
VN*Observation -> *Observation (e.g. VNRecognizedObjectObservation -> RecognizedObjectObservation)
VNRequestCompletionHandler -> async/await
VNImageRequestHandler.perform([VN*Request]) -> *Request.perform()

WWDC Example

The official WWDC video uses a supermarket product scanner as an example.

Most products have a Barcode that can be scanned

We can obtain the location of the Barcode from observation.boundingBox, but unlike the common UIView coordinate system, the BoundingBox’s relative position starts from the lower left corner, with values ranging from 0 to 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
let filePath = Bundle.main.path(forResource: "IMG_6785", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    var request = DetectBarcodesRequest()
    request.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            if let observation = observations.first {
                DispatchQueue.main.async {
                    self.infoLabel.text = observation.payloadString
                    // Color layer marking
                    let colorLayer = CALayer()
                    // iOS >=18 new coordinate transformation API toImageCoordinates
                    // Not tested, may need to calculate the offset for ContentMode = AspectFit:
                    colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                    colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                    self.baseImageView.layer.addSublayer(colorLayer)
                }
                print("BoundingBox: \(observation.boundingBox.cgRect)")
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old approach
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        if let observation = observations.first {
            DispatchQueue.main.async {
                self.infoLabel.text = observation.payloadStringValue
                // Color layer marking
                let colorLayer = CALayer()
                colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
                colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                self.baseImageView.layer.addSublayer(colorLayer)
            }
            print("BoundingBox: \(observation.boundingBox)")
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
    request.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

iOS ≥ 18 Update Highlight:

// iOS ≥18 New Coordinate Transformation API toImageCoordinates
observation.boundingBox.toImageCoordinates(CGSize, origin: .upperLeft)
// https://developer.apple.com/documentation/vision/normalizedpoint/toimagecoordinates(from:imagesize:origin:)

Helper:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Generated by ChatGPT 4o
// Since the photo in the ImageView is set with ContentMode = AspectFit
// Extra calculation is needed for the top and bottom offset caused by Fit
func convertBoundingBox(_ boundingBox: CGRect, to view: UIImageView) -> CGRect {
    guard let image = view.image else {
        return .zero
    }

    let imageSize = image.size
    let viewSize = view.bounds.size
    let imageRatio = imageSize.width / imageSize.height
    let viewRatio = viewSize.width / viewSize.height
    var scaleFactor: CGFloat
    var offsetX: CGFloat = 0
    var offsetY: CGFloat = 0
    if imageRatio > viewRatio {
        // Image fits in the width direction
        scaleFactor = viewSize.width / imageSize.width
        offsetY = (viewSize.height - imageSize.height * scaleFactor) / 2
    }

    else {
        // Image fits in the height direction
        scaleFactor = viewSize.height / imageSize.height
        offsetX = (viewSize.width - imageSize.width * scaleFactor) / 2
    }

    let x = boundingBox.minX * imageSize.width * scaleFactor + offsetX
    let y = (1 - boundingBox.maxY) * imageSize.height * scaleFactor + offsetY
    let width = boundingBox.width * imageSize.width * scaleFactor
    let height = boundingBox.height * imageSize.height * scaleFactor
    return CGRect(x: x, y: y, width: width, height: height)
}

Output:

1
2
3
BoundingBox: (0.5295758928571429, 0.21408638121589782, 0.0943080357142857, 0.21254415360708087)
Payload: 4710018183805
Symbology: VNBarcodeSymbologyEAN13

Some products do not have a barcode, such as loose fruits with only product labels

Therefore, our scanner also needs to support scanning pure text labels simultaneously.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
let filePath = Bundle.main.path(forResource: "apple", ofType: "jpg")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    var barcodesRequest = DetectBarcodesRequest()
    barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    var textRequest = RecognizeTextRequest()
    textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
    Task {
        do {
            let handler = ImageRequestHandler(fileURL)
            // parameter pack syntax and we must wait for all requests to finish before we can use their results.
            // let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
            let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)
            if let observation = barcodesObservation.first {
                DispatchQueue.main.async {
                    self.infoLabel.text = observation.payloadString
                    // Color layer
                    let colorLayer = CALayer()
                    // New Coordinate Transformation API toImageCoordinates for iOS >=18
                    // Not tested, may need to consider the offset of ContentMode = AspectFit:
                    colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                    colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                    self.baseImageView.layer.addSublayer(colorLayer)
                }
                print("BoundingBox: \(observation.boundingBox.cgRect)")
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
            textObservation.forEach {
                observation in
                let topCandidate = observation.topCandidates(1).first
                print(topCandidate?.string ?? "No text recognized")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old approach
    let barcodesCompletionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        if let observation = observations.first {
            DispatchQueue.main.async {
                self.infoLabel.text = observation.payloadStringValue
                // Color layer
                let colorLayer = CALayer()
                colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
                colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                self.baseImageView.layer.addSublayer(colorLayer)
            }
            print("BoundingBox: \(observation.boundingBox)")
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let textCompletionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedTextObservation] else {
            return
        }
        observations.forEach {
            observation in
            let topCandidate = observation.topCandidates(1).first
            print(topCandidate?.string ?? "No text recognized")
        }
    }

    let barcodesRequest = VNDetectBarcodesRequest(completionHandler: barcodesCompletionHandler)
    barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    let textRequest = VNRecognizeTextRequest(completionHandler: textCompletionHandler)
    textRequest.recognitionLevel = .accurate
    textRequest.recognitionLanguages = ["en-US"]
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([barcodesRequest, textRequest])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Output:

1
2
3
4
94128s
ORGANIC
Pink Lady®
Produce of USh

iOS ≥ 18 Update Highlight:

1
2
3
4
let handler = ImageRequestHandler(fileURL)
// parameter pack syntax and we must wait for all requests to finish before we can use their results.
// let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)

iOS ≥ 18 performAll( ) method

The previous perform(barcodesRequest, textRequest) method for handling Barcode scanning and text recognition required both requests to be completed before continuing execution; starting from iOS 18, a new performAll() method is provided, changing the response method to streaming, allowing corresponding processing as soon as one of the requests is received, such as responding directly when a Barcode is scanned.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
if #available(iOS 18.0, *) {
    // New API using Swift features
    var barcodesRequest = DetectBarcodesRequest()
    barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcodes is needed, it can be specified directly to improve performance
    var textRequest = RecognizeTextRequest()
    textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
    Task {
        let handler = ImageRequestHandler(fileURL)
        let observation = handler.performAll([barcodesRequest, textRequest] as [any VisionRequest])
        for try await result in observation {
            switch result {
                case .detectBarcodes(_, let barcodesObservation):
                if let observation = barcodesObservation.first {
                    DispatchQueue.main.async {
                        self.infoLabel.text = observation.payloadString
                        // Color layer marking
                        let colorLayer = CALayer()
                        // iOS >=18 new coordinate transformation API toImageCoordinates
                        // Not tested, may still need to calculate the offset for ContentMode = AspectFit:
                        colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                        colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                        self.baseImageView.layer.addSublayer(colorLayer)
                    }
                    print("BoundingBox: \(observation.boundingBox.cgRect)")
                    print("Payload: \(observation.payloadString ?? "No payload")")
                    print("Symbology: \(observation.symbology)")
                }
                case .recognizeText(_, let textObservation):
                textObservation.forEach {
                    observation in
                    let topCandidate = observation.topCandidates(1).first
                    print(topCandidate?.string ?? "No text recognized")
                }
                default:
                print("Unrecognized result: \(result)")
            }
        }
    }
}

Optimize with Swift Concurrency

Assuming we have a list of image wall, and each image needs to automatically crop out the main object; this is where we can leverage Swift Concurrency to improve loading efficiency.

Original Implementation

1
2
3
4
5
6
7
8
9
10
11
func generateThumbnail(url: URL) async throws -> UIImage {
  let request = GenerateAttentionBasedSaliencyImageRequest()
  let saliencyObservation = try await request.perform(on: url)
  return cropImage(url, to: saliencyObservation.salientObjects)
}
    
func generateAllThumbnails() async throws {
  for image in images {
    image.thumbnail = try await generateThumbnail(url: image.url)
  }
}

Executing one at a time, slow efficiency and performance.

Optimization (1) — TaskGroup Concurrency

1
2
3
4
5
6
7
func generateAllThumbnails() async throws {
  try await withThrowingDiscardingTaskGroup { taskGroup in
    for image in images {
      image.thumbnail = try await generateThumbnail(url: image.url)
     }
  }
}

Adding each Task to TaskGroup Concurrency for execution.

Issue: Image recognition and cropping operations are memory-intensive. Unrestrained parallel tasks may cause user lagging and OOM crashes.

Optimization (2) — TaskGroup Concurrency + Limiting Parallelism

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func generateAllThumbnails() async throws {
    try await withThrowingDiscardingTaskGroup {
        taskGroup in
        // Maximum execution not to exceed 5
        let maxImageTasks = min(5, images.count)
        // Fill in 5 tasks first
        for index in 0..<maxImageTasks {
            taskGroup.addTask {
                image[index].thumbnail = try await generateThumbnail(url: image[index].url)
            }
        }
        var nextIndex = maxImageTasks
        for try await _ in taskGroup {
            // When a Task in taskGroup completes await...
            // Check if the Index reaches the end
            if nextIndex < images.count {
                let image = images[nextIndex]
                // Continue filling tasks one by one (maintaining at most 5)
                taskGroup.addTask {
                    image.thumbnail = try await generateThumbnail(url: image.url)
                }
                nextIndex += 1
            }
        }
    }
}

Update an existing Vision app

  1. Vision will remove CPU and GPU support for some requests on devices with a neural engine. On these devices, the neural engine is the best choice for performance. You can check using the supportedComputeDevices() API.
  2. Remove all VN prefixes VNXXRequest, VNXXXObservation -> Request, Observation
  3. Replace the original VNRequestCompletionHandler with async/await.
  4. Use *Request.perform() directly instead of VNImageRequestHandler.perform([VN*Request]).

Wrap-up

  • API designed for Swift language features
  • New features and methods are Swift Only, available for iOS ≥ 18
  • New image scoring feature, body + hand movement tracking

Thanks!

KKday Business Recruitment

👉👉👉This book club sharing is derived from the weekly technical sharing activities within the KKday App Team. The team is currently enthusiastically recruiting Senior iOS Engineer , interested friends are welcome to submit resumes.👈👈👈

Reference

Discover Swift enhancements in the Vision framework

The Vision Framework API has been redesigned to leverage modern Swift features like concurrency, making it easier and faster to integrate a wide array of Vision algorithms into your app. We’ll tour the updated API and share sample code, along with best practices, to help you get the benefits of this framework with less coding effort. We’ll also demonstrate two new features: image aesthetics and holistic body pose.

Chapters

Vision framework Apple Developer Documentation

-

Feel free to contact me for any questions or feedback.

===

本文中文版本

===

This article was first published in Traditional Chinese on Medium ➡️ View Here



This post is licensed under CC BY 4.0 by the author.

Medium Partner Program is finally open to global (including Taiwan) writers!

iOS Shortcut Automation Scenarios - Automatically Forwarding Text Messages and Creating Reminder Tasks