⚠️⚠️⚠️ This site is deprecated. Click here to visit the new site and content. ⚠️⚠️⚠️

iOS Vision framework x WWDC 24 Discover Swift enhancements in the Vision framework Session

Vision framework review & trying out new Swift API in iOS 18

$Photo by [BoliviaInteligente](https://unsplash.com/@boliviainteligente?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash){:target="_blank"}$

Photo by BoliviaInteligente

Topic

The relationship with Vision Pro is like the relationship between hot dogs and dogs, completely unrelated.

Vision framework

The Vision framework is Apple’s integrated image recognition framework for machine learning, allowing developers to easily and quickly implement common image recognition functions. The Vision framework was introduced as early as iOS 11.0+ (2017/iPhone 8) and has been continuously iterated and optimized. It enhances performance by integrating features with Swift Concurrency and provides a new Swift Vision framework API from iOS 18.0 to maximize the benefits of Swift Concurrency.

Features of Vision framework

Built-in numerous image recognition and motion tracking methods (up to 31 as of iOS 18)
On-Device computation using only the phone’s chip, independent of cloud services, fast and secure
Simple and easy-to-use API
Apple supports all platforms: iOS 11.0+, iPadOS 11.0+, Mac Catalyst 13.0+, macOS 10.13+, tvOS 11.0+, visionOS 1.0+
Released for multiple years (2017-present) and continuously updated
Enhances computational performance by integrating Swift language features

Played around 6 years ago: Exploring Vision - Automatically Recognizing Faces for App Avatar Cropping (Swift)

This time, in conjunction with WWDC 24 Discover Swift enhancements in the Vision framework Session, revisiting and combining new Swift features to play again.

CoreML

Apple also has another framework called CoreML, which is a machine learning framework based on On-Device chips. It allows you to train models for objects or documents you want to recognize and use the models directly in the app. Interested friends can also give it a try. (e.g. Real-time article classification, real-time spam message detection …)

p.s.

Vision v.s. VisionKit:

Vision: Mainly used for image analysis tasks such as face recognition, barcode detection, text recognition, etc. It provides powerful APIs to handle and analyze visual content in static images or videos.

VisionKit: Specifically designed for tasks related to document scanning. It offers a scanner view controller that can be used to scan documents and generate high-quality PDFs or images.

The Vision framework cannot run on the M1 model in the simulator, it can only be tested on a physical device; running in a simulator environment will throw a Could not create Espresso context error, no solution found in the official forum discussion.

Since I don’t have a physical iOS 18 device for testing, all the execution results in this article are based on the old (pre-iOS 18) syntax; please leave a comment if there are errors with the new syntax.

WWDC 2024 — Discover Swift enhancements in the Vision framework

Discover Swift enhancements in the Vision framework

This article is a sharing note for WWDC 24 — Discover Swift enhancements in the Vision framework session, along with some experimental insights.

Introduction — Vision framework Features

Face recognition, contour recognition

Text recognition in image content

As of iOS 18, it supports 18 languages.

  
// Supported language list
if #available(iOS 18.0, *) {
  print(RecognizeTextRequest().supportedRecognitionLanguages.map { "\($0.languageCode!)-\(($0.region?.identifier ?? $0.script?.identifier)!)" })
} else {
  print(try! VNRecognizeTextRequest().supportedRecognitionLanguages())
}

// The actual available recognition languages are based on this.
// Tested on iOS 18, the output is as follows:
// ["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant", "yue-Hans", "yue-Hant", "ko-KR", "ja-JP", "ru-RU", "uk-UA", "th-TH", "vi-VT", "ar-SA", "ars-SA"]
// Swedish language mentioned in WWDC was not seen, unsure if it has not been released yet or is related to device region and language settings

Dynamic motion capture

Can achieve dynamic capture of people and objects
Gesture capture implements air signature function

What’s new in Vision? (iOS 18)— Image rating feature (quality, key points)

Calculate scores for input images to easily filter out high-quality photos
The scoring method includes multiple dimensions, not just image quality, but also lighting, angles, shooting subjects, whether there are memorable points … and so on

WWDC provided the above three images for explanation (under the same image quality), which are:

High-scoring image: composition, lighting, memorable points
Low-scoring image: no main subject, looks like taken casually or accidentally
Utility image: technically well-taken but lacks memorable points, like images used for stock photo libraries

iOS ≥ 18 New API: CalculateImageAestheticsScoresRequest

  
let request = CalculateImageAestheticsScoresRequest()
let result = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)

// Photo score
print(result.overallScore)

// Whether it is judged as a utility image
print(result.isUtility)

What’s new in Vision? (iOS 18) — Simultaneous detection of body and gesture poses

In the past, only body pose and hand pose could be detected separately.

With this update, developers can detect both body and hand poses simultaneously, combining them into a single request and result, making it more convenient for further feature development.

iOS ≥ 18 New API: DetectHumanBodyPoseRequest

  
var request = DetectHumanBodyPoseRequest()
// Detect hand pose together
request.detectsHands = true

guard let bodyPose = try await request.perform(on: image). first else { return }

// Body Pose Joints
let bodyJoints = bodyPose.allJoints()
// Left hand Pose Joints
let leftHandJoints = bodyPose.leftHand.allJoints()
// Right hand Pose Joints
let rightHandJoints = bodyPose.rightHand.allJoints()

New Vision API

Apple provides new Swift Vision API wrappers for developers in this update, in addition to basic support for existing functionalities, mainly focusing on enhancing Swift 6 / Swift Concurrency features, providing more efficient and Swift-like API operation methods.

Get started with Vision

The speaker here reintroduced the basic usage of the Vision framework. Apple has encapsulated 31 types of common image recognition requests and their corresponding “Observation” objects (as of iOS 18).

Request: DetectFaceRectanglesRequest - Face area recognition request Result: FaceObservation The previous article “Exploring Vision - Automatically Identify Faces for Avatar Upload in Apps (Swift)” used this pair of requests.
Request: RecognizeTextRequest - Text recognition request Result: RecognizedTextObservation
Request: GenerateObjectnessBasedSaliencyImageRequest - Objectness-based object recognition request Result: SaliencyImageObservation

All 31 types of requests:

VisionRequest.

Request Purpose	Observation Description
CalculateImageAestheticsScoresRequest Calculate the aesthetic score of the image.	AestheticsObservation Returns the aesthetic score of the image, considering factors like composition and color.
ClassifyImageRequest Classify the content of the image.	ClassificationObservation Returns the classification labels and confidence of objects or scenes in the image.
CoreMLRequest Analyze images using Core ML models.	CoreMLFeatureValueObservation Generates observations based on the output of Core ML models.
DetectAnimalBodyPoseRequest Detect animal poses in images.	RecognizedPointsObservation Returns the skeleton points and their positions of animals.
DetectBarcodesRequest Detect barcodes in images.	BarcodeObservation Returns barcode data and types (e.g., QR code).
DetectContoursRequest Detect contours in images.	ContoursObservation Returns detected contour lines in the image.
DetectDocumentSegmentationRequest Detect and segment documents in images.	RectangleObservation Returns the rectangular boundary positions of documents.
DetectFaceCaptureQualityRequest Evaluate the quality of face captures.	FaceObservation Returns quality assessment scores for facial images.
DetectFaceLandmarksRequest Detect facial landmarks.	FaceObservation Returns detailed positions of facial landmarks (e.g., eyes, nose).
DetectFaceRectanglesRequest Detect faces in images.	FaceObservation Returns the bounding box positions of faces.
DetectHorizonRequest Detect horizons in images.	HorizonObservation Returns the angle and position of the horizon.
DetectHumanBodyPose3DRequest Detect 3D human body poses in images.	RecognizedPointsObservation Returns 3D human skeleton points and their spatial coordinates.
DetectHumanBodyPoseRequest Detect human body poses in images.	RecognizedPointsObservation Returns human skeleton points and their coordinates.
DetectHumanHandPoseRequest Detect hand poses in images.	RecognizedPointsObservation Returns hand skeleton points and their positions.
DetectHumanRectanglesRequest Detect humans in images.	HumanObservation Returns the bounding box positions of humans.
DetectRectanglesRequest Detect rectangles in images.	RectangleObservation Returns the coordinates of the four vertices of rectangles.
DetectTextRectanglesRequest Detect text regions in images.	TextObservation Returns the positions and bounding boxes of text regions.
DetectTrajectoriesRequest Detect and analyze object motion trajectories.	TrajectoryObservation Returns motion trajectory points and their time series.
GenerateAttentionBasedSaliencyImageRequest Generate attention-based saliency images.	SaliencyImageObservation Returns saliency maps of the most attractive areas in the image.
GenerateForegroundInstanceMaskRequest Generate foreground instance mask images.	InstanceMaskObservation Returns masks of foreground objects.
GenerateImageFeaturePrintRequest Generate image feature prints for comparison.	FeaturePrintObservation Returns feature fingerprint data of images for similarity comparison.
GenerateObjectnessBasedSaliencyImageRequest Generate objectness-based saliency images.	SaliencyImageObservation Returns saliency maps of object saliency areas.
GeneratePersonInstanceMaskRequest Generate person instance mask images.	InstanceMaskObservation Returns masks of person instances.
GeneratePersonSegmentationRequest Generate person segmentation images.	SegmentationObservation Returns binary images of person segmentation.
RecognizeAnimalsRequest Detect and identify animals in images.	RecognizedObjectObservation Returns animal types and their confidence levels.
RecognizeTextRequest Detect and identify text in images.	RecognizedTextObservation Returns detected text content and its spatial positions.
TrackHomographicImageRegistrationRequest Track homographic image registration.	ImageAlignmentObservation Returns homographic transformation matrices between images for image registration.
TrackObjectRequest Track objects in images.	DetectedObjectObservation Returns the positions and velocity information of objects in images.
TrackOpticalFlowRequest Track optical flow in images.	OpticalFlowObservation Returns optical flow vector fields describing pixel movements.
TrackRectangleRequest Track rectangles in images.	RectangleObservation Returns the positions, sizes, and rotation angles of rectangles in images.
TrackTranslationalImageRegistrationRequest Track translational image registration.	ImageAlignmentObservation Returns translational transformation matrices between images for image registration.

Prefixing VN in front is the old API writing method (before iOS 18)

The speaker mentioned several commonly used Requests as follows.

ClassifyImageRequest

Recognize the input image, obtain label classification and confidence.

[Travelogue] 2024 Second Visit to Kyushu 9-Day Free and Easy Trip, Entering Fukuoka by Busan→Hakata Cruise

  
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = ClassifyImageRequest()
    Task {
        do {
            let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)
            observations.forEach {
                observation in
                print("\(observation.identifier): \(observation.confidence)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old method
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNClassificationObservation] else {
            return
        }
        observations.forEach {
            observation in
            print("\(observation.identifier): \(observation.confidence)")
        }
    }

    let request = VNClassifyImageRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*3_jdrLurFuUfNdW4BJaRww.jpeg")!, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

  
 • outdoor: 0.75392926
 • sky: 0.75392926
 • blue_sky: 0.7519531
 • machine: 0.6958008
 • cloudy: 0.26538086
 • structure: 0.15728651
 • sign: 0.14224191
 • fence: 0.118652344
 • banner: 0.0793457
 • material: 0.075975396
 • plant: 0.054406323
 • foliage: 0.05029297
 • light: 0.048126098
 • lamppost: 0.048095703
 • billboards: 0.040039062
 • art: 0.03977703
 • branch: 0.03930664
 • decoration: 0.036868922
 • flag: 0.036865234
....etc

RecognizeTextRequest

Recognize the text content in the image (a.k.a OCR)

[Travelogue] 2023 Tokyo 5-day free trip](../9da2c51fa4f2/)

  
if #available(iOS 18.0, *) {
    // New API using Swift features
    var request = RecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.recognitionLanguages = [.init(identifier: "ja-JP"), .init(identifier: "en-US")] // Specify language code, e.g., Traditional Chinese
    Task {
        do {
            let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!)
            observations.forEach {
                observation in
                let topCandidate = observation.topCandidates(1).first
                print(topCandidate?.string ?? "No text recognized")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old way
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedTextObservation] else {
            return
        }
        observations.forEach {
            observation in
            let topCandidate = observation.topCandidates(1).first
            print(topCandidate?.string ?? "No text recognized")
        }
    }

    let request = VNRecognizeTextRequest(completionHandler: completionHandler)
    request.recognitionLevel = .accurate
    request.recognitionLanguages = ["ja-JP", "en-US"] // Specify language code, e.g., Traditional Chinese
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Result:

  
LE LABO Aoyama Store
TEL:03-6419-7167
*Thank you for your purchase*
No: 21347
Date: 2023/06/10 14.14.57
Responsible:
1690370
Register: 008A 1
Product Name
Tax-inclusive Price Quantity Tax-inclusive Total
Kaiak 10 EDP FB 15ML
J1P7010000S
16,800
16,800
Another 13 EDP FB 15ML
J1PJ010000S
10,700
10,700
Lip Balm 15ML
JOWC010000S
2,000
1
Total Amount
(Tax Included)
CARD
2,000
3 items purchased
29,500
0
29,500
29,500

DetectBarcodesRequest

Detect barcode and QR code data in the image.

Thai locals recommend Goose Brand Cooling Gel

  
let filePath = Bundle.main.path(forResource: "IMG_6777", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = DetectBarcodesRequest()
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            observations.forEach {
                observation in
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old way
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        observations.forEach {
            observation in
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

  
Payload: 8859126000911
Symbology: VNBarcodeSymbologyEAN13
Payload: https://lin.ee/hGynbVM
Symbology: VNBarcodeSymbologyQR
Payload: http://www.hongthaipanich.com/
Symbology: VNBarcodeSymbologyQR
Payload: https://www.facebook.com/qr?id=100063856061714
Symbology: VNBarcodeSymbologyQR

RecognizeAnimalsRequest

Recognize animals in the image with confidence.

$[meme Source](https://www.redbubble.com/i/canvas-print/Funny-AI-Woman-yelling-at-a-cat-meme-design-Machine-learning-by-omolog/43039298.5Y5V7){:target="_blank"}$

meme Source

  
let filePath = Bundle.main.path(forResource: "IMG_5026", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    let request = RecognizeAnimalsRequest()
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            observations.forEach {
                observation in
                let labels = observation.labels
                labels.forEach {
                    label in
                    print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
                }
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old way
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedObjectObservation] else {
            return
        }
        observations.forEach {
            observation in
            let labels = observation.labels
            labels.forEach {
                label in
                print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
            }
        }
    }

    let request = VNRecognizeAnimalsRequest(completionHandler: completionHandler)
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Analysis Results:

  
Detected animal: Cat with confidence: 0.77245045

Others:

Detecting human body in images: DetectHumanRectanglesRequest
Detecting poses of animals and humans (3D or 2D): DetectAnimalBodyPoseRequest, DetectHumanBodyPose3DRequest, DetectHumanBodyPoseRequest, DetectHumanHandPoseRequest
Detecting and tracking object trajectories (in different frames of videos, animations): DetectTrajectoriesRequest, TrackObjectRequest, TrackRectangleRequest

iOS ≥ 18 Update Highlight:

  
VN*Request -> *Request (e.g. VNDetectBarcodesRequest -> DetectBarcodesRequest)
VN*Observation -> *Observation (e.g. VNRecognizedObjectObservation -> RecognizedObjectObservation)
VNRequestCompletionHandler -> async/await
VNImageRequestHandler.perform([VN*Request]) -> *Request.perform()

WWDC Example

The official WWDC video uses a supermarket product scanner as an example.

Most products have a Barcode that can be scanned

We can obtain the location of the Barcode from observation.boundingBox, but unlike the common UIView coordinate system, the BoundingBox’s relative position starts from the lower left corner, with values ranging from 0 to 1.

  
let filePath = Bundle.main.path(forResource: "IMG_6785", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    var request = DetectBarcodesRequest()
    request.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    Task {
        do {
            let observations = try await request.perform(on: fileURL)
            if let observation = observations.first {
                DispatchQueue.main.async {
                    self.infoLabel.text = observation.payloadString
                    // Color layer marking
                    let colorLayer = CALayer()
                    // iOS >=18 new coordinate transformation API toImageCoordinates
                    // Not tested, may need to calculate the offset for ContentMode = AspectFit:
                    colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                    colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                    self.baseImageView.layer.addSublayer(colorLayer)
                }
                print("BoundingBox: \(observation.boundingBox.cgRect)")
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old approach
    let completionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        if let observation = observations.first {
            DispatchQueue.main.async {
                self.infoLabel.text = observation.payloadStringValue
                // Color layer marking
                let colorLayer = CALayer()
                colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
                colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                self.baseImageView.layer.addSublayer(colorLayer)
            }
            print("BoundingBox: \(observation.boundingBox)")
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
    request.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([request])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

iOS ≥ 18 Update Highlight:

// iOS ≥18 New Coordinate Transformation API toImageCoordinates
observation.boundingBox.toImageCoordinates(CGSize, origin: .upperLeft)
// https://developer.apple.com/documentation/vision/normalizedpoint/toimagecoordinates(from:imagesize:origin:)

Helper:

  
// Generated by ChatGPT 4o
// Since the photo in the ImageView is set with ContentMode = AspectFit
// Extra calculation is needed for the top and bottom offset caused by Fit
func convertBoundingBox(_ boundingBox: CGRect, to view: UIImageView) -> CGRect {
    guard let image = view.image else {
        return .zero
    }

    let imageSize = image.size
    let viewSize = view.bounds.size
    let imageRatio = imageSize.width / imageSize.height
    let viewRatio = viewSize.width / viewSize.height
    var scaleFactor: CGFloat
    var offsetX: CGFloat = 0
    var offsetY: CGFloat = 0
    if imageRatio > viewRatio {
        // Image fits in the width direction
        scaleFactor = viewSize.width / imageSize.width
        offsetY = (viewSize.height - imageSize.height * scaleFactor) / 2
    }

    else {
        // Image fits in the height direction
        scaleFactor = viewSize.height / imageSize.height
        offsetX = (viewSize.width - imageSize.width * scaleFactor) / 2
    }

    let x = boundingBox.minX * imageSize.width * scaleFactor + offsetX
    let y = (1 - boundingBox.maxY) * imageSize.height * scaleFactor + offsetY
    let width = boundingBox.width * imageSize.width * scaleFactor
    let height = boundingBox.height * imageSize.height * scaleFactor
    return CGRect(x: x, y: y, width: width, height: height)
}

Output:

  
BoundingBox: (0.5295758928571429, 0.21408638121589782, 0.0943080357142857, 0.21254415360708087)
Payload: 4710018183805
Symbology: VNBarcodeSymbologyEAN13

Some products do not have a barcode, such as loose fruits with only product labels

Therefore, our scanner also needs to support scanning pure text labels simultaneously.

  
let filePath = Bundle.main.path(forResource: "apple", ofType: "jpg")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
    // New API using Swift features
    var barcodesRequest = DetectBarcodesRequest()
    barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    var textRequest = RecognizeTextRequest()
    textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
    Task {
        do {
            let handler = ImageRequestHandler(fileURL)
            // parameter pack syntax and we must wait for all requests to finish before we can use their results.
            // let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
            let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)
            if let observation = barcodesObservation.first {
                DispatchQueue.main.async {
                    self.infoLabel.text = observation.payloadString
                    // Color layer
                    let colorLayer = CALayer()
                    // New Coordinate Transformation API toImageCoordinates for iOS >=18
                    // Not tested, may need to consider the offset of ContentMode = AspectFit:
                    colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                    colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                    self.baseImageView.layer.addSublayer(colorLayer)
                }
                print("BoundingBox: \(observation.boundingBox.cgRect)")
                print("Payload: \(observation.payloadString ?? "No payload")")
                print("Symbology: \(observation.symbology)")
            }
            textObservation.forEach {
                observation in
                let topCandidate = observation.topCandidates(1).first
                print(topCandidate?.string ?? "No text recognized")
            }
        }
        catch {
            print("Request failed: \(error)")
        }
    }
} else {
    // Old approach
    let barcodesCompletionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNBarcodeObservation] else {
            return
        }
        if let observation = observations.first {
            DispatchQueue.main.async {
                self.infoLabel.text = observation.payloadStringValue
                // Color layer
                let colorLayer = CALayer()
                colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
                colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                self.baseImageView.layer.addSublayer(colorLayer)
            }
            print("BoundingBox: \(observation.boundingBox)")
            print("Payload: \(observation.payloadStringValue ?? "No payload")")
            print("Symbology: \(observation.symbology.rawValue)")
        }
    }

    let textCompletionHandler: VNRequestCompletionHandler = {
        request, error in
        guard error == nil else {
            print("Request failed: \(String(describing: error))")
            return
        }
        guard let observations = request.results as? [VNRecognizedTextObservation] else {
            return
        }
        observations.forEach {
            observation in
            let topCandidate = observation.topCandidates(1).first
            print(topCandidate?.string ?? "No text recognized")
        }
    }

    let barcodesRequest = VNDetectBarcodesRequest(completionHandler: barcodesCompletionHandler)
    barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
    let textRequest = VNRecognizeTextRequest(completionHandler: textCompletionHandler)
    textRequest.recognitionLevel = .accurate
    textRequest.recognitionLanguages = ["en-US"]
    DispatchQueue.global().async {
        let handler = VNImageRequestHandler(url: fileURL, options: [:])
        do {
            try handler.perform([barcodesRequest, textRequest])
        }
        catch {
            print("Request failed: \(error)")
        }
    }
}

Output:

94128s
ORGANIC
Pink Lady®
Produce of USh

iOS ≥ 18 Update Highlight:

  
let handler = ImageRequestHandler(fileURL)
// parameter pack syntax and we must wait for all requests to finish before we can use their results.
// let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)

iOS ≥ 18 performAll( ) method

The previous perform(barcodesRequest, textRequest) method for handling Barcode scanning and text recognition required both requests to be completed before continuing execution; starting from iOS 18, a new performAll() method is provided, changing the response method to streaming, allowing corresponding processing as soon as one of the requests is received, such as responding directly when a Barcode is scanned.

  
if #available(iOS 18.0, *) {
    // New API using Swift features
    var barcodesRequest = DetectBarcodesRequest()
    barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcodes is needed, it can be specified directly to improve performance
    var textRequest = RecognizeTextRequest()
    textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
    Task {
        let handler = ImageRequestHandler(fileURL)
        let observation = handler.performAll([barcodesRequest, textRequest] as [any VisionRequest])
        for try await result in observation {
            switch result {
                case .detectBarcodes(_, let barcodesObservation):
                if let observation = barcodesObservation.first {
                    DispatchQueue.main.async {
                        self.infoLabel.text = observation.payloadString
                        // Color layer marking
                        let colorLayer = CALayer()
                        // iOS >=18 new coordinate transformation API toImageCoordinates
                        // Not tested, may still need to calculate the offset for ContentMode = AspectFit:
                        colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
                        colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
                        self.baseImageView.layer.addSublayer(colorLayer)
                    }
                    print("BoundingBox: \(observation.boundingBox.cgRect)")
                    print("Payload: \(observation.payloadString ?? "No payload")")
                    print("Symbology: \(observation.symbology)")
                }
                case .recognizeText(_, let textObservation):
                textObservation.forEach {
                    observation in
                    let topCandidate = observation.topCandidates(1).first
                    print(topCandidate?.string ?? "No text recognized")
                }
                default:
                print("Unrecognized result: \(result)")
            }
        }
    }
}

Optimize with Swift Concurrency

Assuming we have a list of image wall, and each image needs to automatically crop out the main object; this is where we can leverage Swift Concurrency to improve loading efficiency.

Original Implementation

  
func generateThumbnail(url: URL) async throws -> UIImage {
  let request = GenerateAttentionBasedSaliencyImageRequest()
  let saliencyObservation = try await request.perform(on: url)
  return cropImage(url, to: saliencyObservation.salientObjects)
}
    
func generateAllThumbnails() async throws {
  for image in images {
    image.thumbnail = try await generateThumbnail(url: image.url)
  }
}

Executing one at a time, slow efficiency and performance.

Optimization (1) — TaskGroup Concurrency

  
func generateAllThumbnails() async throws {
  try await withThrowingDiscardingTaskGroup { taskGroup in
    for image in images {
      image.thumbnail = try await generateThumbnail(url: image.url)
     }
  }
}

Adding each Task to TaskGroup Concurrency for execution.

Issue: Image recognition and cropping operations are memory-intensive. Unrestrained parallel tasks may cause user lagging and OOM crashes.

Optimization (2) — TaskGroup Concurrency + Limiting Parallelism

  
func generateAllThumbnails() async throws {
    try await withThrowingDiscardingTaskGroup {
        taskGroup in
        // Maximum execution not to exceed 5
        let maxImageTasks = min(5, images.count)
        // Fill in 5 tasks first
        for index in 0..<maxImageTasks {
            taskGroup.addTask {
                image[index].thumbnail = try await generateThumbnail(url: image[index].url)
            }
        }
        var nextIndex = maxImageTasks
        for try await _ in taskGroup {
            // When a Task in taskGroup completes await...
            // Check if the Index reaches the end
            if nextIndex < images.count {
                let image = images[nextIndex]
                // Continue filling tasks one by one (maintaining at most 5)
                taskGroup.addTask {
                    image.thumbnail = try await generateThumbnail(url: image.url)
                }
                nextIndex += 1
            }
        }
    }
}

Update an existing Vision app

Vision will remove CPU and GPU support for some requests on devices with a neural engine. On these devices, the neural engine is the best choice for performance. You can check using the supportedComputeDevices() API.
Remove all VN prefixes VNXXRequest, VNXXXObservation -> Request, Observation
Replace the original VNRequestCompletionHandler with async/await.
Use *Request.perform() directly instead of VNImageRequestHandler.perform([VN*Request]).

Wrap-up

API designed for Swift language features
New features and methods are Swift Only, available for iOS ≥ 18
New image scoring feature, body + hand movement tracking

Thanks!

KKday Business Recruitment

👉👉👉This book club sharing is derived from the weekly technical sharing activities within the KKday App Team. The team is currently enthusiastically recruiting Senior iOS Engineer , interested friends are welcome to submit resumes.👈👈👈

Reference

Discover Swift enhancements in the Vision framework

The Vision Framework API has been redesigned to leverage modern Swift features like concurrency, making it easier and faster to integrate a wide array of Vision algorithms into your app. We’ll tour the updated API and share sample code, along with best practices, to help you get the benefits of this framework with less coding effort. We’ll also demonstrate two new features: image aesthetics and holistic body pose.

Chapters

Vision framework Apple Developer Documentation

Feel free to contact me for any questions or feedback.

===

本文中文版本

===

This article was first published in Traditional Chinese on Medium ➡️ View Here

⚠️⚠️⚠️ This site is deprecated. Click here to visit the new site and content. ⚠️⚠️⚠️

iOS Vision framework x WWDC 24 Discover Swift enhancements in the Vision framework Session

⚠️⚠️⚠️ This site is deprecated. Click here to visit the new site and content. ⚠️⚠️⚠️

iOS Vision framework x WWDC 24 Discover Swift enhancements in the Vision framework Session

Topic

Vision framework

CoreML

p.s.

WWDC 2024 — Discover Swift enhancements in the Vision framework

Introduction — Vision framework Features

Face recognition, contour recognition

Text recognition in image content

Dynamic motion capture

What’s new in Vision? (iOS 18)— Image rating feature (quality, key points)

What’s new in Vision? (iOS 18) — Simultaneous detection of body and gesture poses

New Vision API

Get started with Vision

All 31 types of requests:

ClassifyImageRequest

RecognizeTextRequest

DetectBarcodesRequest

RecognizeAnimalsRequest

Others:

iOS ≥ 18 Update Highlight:

WWDC Example

Most products have a Barcode that can be scanned

Some products do not have a barcode, such as loose fruits with only product labels

iOS ≥ 18 performAll( ) method

Optimize with Swift Concurrency

Original Implementation

Optimization (1) — TaskGroup Concurrency

Optimization (2) — TaskGroup Concurrency + Limiting Parallelism

Update an existing Vision app

Wrap-up

Thanks!

KKday Business Recruitment

Reference

Discover Swift enhancements in the Vision framework

Chapters

Vision framework Apple Developer Documentation

⚠️⚠️⚠️ This site is deprecated. Click here to visit the new site and content. ⚠️⚠️⚠️

Further Reading

Exploring Vision — Automatic Face Detection and Cropping for Profile Pictures (Swift)

Exploring iOS 12 CoreML — Automatically Predict Article Categories Using Machine Learning, Even Train the Model Yourself!

Practical Application Record of Design Patterns—In WKWebView with Builder, Strategy & Chain of Responsibility Pattern