iOS Vision framework x WWDC 24 Discover Swift enhancements in the Vision framework Session
Vision framework review & trying out new Swift API in iOS 18
Photo by BoliviaInteligente
Topic
The relationship with Vision Pro is like the relationship between hot dogs and dogs, completely unrelated.
Vision framework
The Vision framework is Apple’s integrated image recognition framework for machine learning, allowing developers to easily and quickly implement common image recognition functions. The Vision framework was introduced as early as iOS 11.0+ (2017/iPhone 8) and has been continuously iterated and optimized. It enhances performance by integrating features with Swift Concurrency and provides a new Swift Vision framework API from iOS 18.0 to maximize the benefits of Swift Concurrency.
Features of Vision framework
- Built-in numerous image recognition and motion tracking methods (up to 31 as of iOS 18)
- On-Device computation using only the phone’s chip, independent of cloud services, fast and secure
- Simple and easy-to-use API
- Apple supports all platforms: iOS 11.0+, iPadOS 11.0+, Mac Catalyst 13.0+, macOS 10.13+, tvOS 11.0+, visionOS 1.0+
- Released for multiple years (2017-present) and continuously updated
- Enhances computational performance by integrating Swift language features
Played around 6 years ago: Exploring Vision - Automatically Recognizing Faces for App Avatar Cropping (Swift)
This time, in conjunction with WWDC 24 Discover Swift enhancements in the Vision framework Session, revisiting and combining new Swift features to play again.
CoreML
Apple also has another framework called CoreML, which is a machine learning framework based on On-Device chips. It allows you to train models for objects or documents you want to recognize and use the models directly in the app. Interested friends can also give it a try. (e.g. Real-time article classification, real-time spam message detection …)
p.s.
Vision v.s. VisionKit:
Vision: Mainly used for image analysis tasks such as face recognition, barcode detection, text recognition, etc. It provides powerful APIs to handle and analyze visual content in static images or videos.
VisionKit: Specifically designed for tasks related to document scanning. It offers a scanner view controller that can be used to scan documents and generate high-quality PDFs or images.
The Vision framework cannot run on the M1 model in the simulator, it can only be tested on a physical device; running in a simulator environment will throw a Could not create Espresso context
error, no solution found in the official forum discussion.
Since I don’t have a physical iOS 18 device for testing, all the execution results in this article are based on the old (pre-iOS 18) syntax; please leave a comment if there are errors with the new syntax.
WWDC 2024 — Discover Swift enhancements in the Vision framework
Discover Swift enhancements in the Vision framework
This article is a sharing note for WWDC 24 — Discover Swift enhancements in the Vision framework session, along with some experimental insights.
Introduction — Vision framework Features
Face recognition, contour recognition
Text recognition in image content
As of iOS 18, it supports 18 languages.
1
2
3
4
5
6
7
8
9
10
11
// Supported language list
if #available(iOS 18.0, *) {
print(RecognizeTextRequest().supportedRecognitionLanguages.map { "\($0.languageCode!)-\(($0.region?.identifier ?? $0.script?.identifier)!)" })
} else {
print(try! VNRecognizeTextRequest().supportedRecognitionLanguages())
}
// The actual available recognition languages are based on this.
// Tested on iOS 18, the output is as follows:
// ["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant", "yue-Hans", "yue-Hant", "ko-KR", "ja-JP", "ru-RU", "uk-UA", "th-TH", "vi-VT", "ar-SA", "ars-SA"]
// Swedish language mentioned in WWDC was not seen, unsure if it has not been released yet or is related to device region and language settings
Dynamic motion capture
- Can achieve dynamic capture of people and objects
- Gesture capture implements air signature function
What’s new in Vision? (iOS 18)— Image rating feature (quality, key points)
- Calculate scores for input images to easily filter out high-quality photos
- The scoring method includes multiple dimensions, not just image quality, but also lighting, angles, shooting subjects, whether there are memorable points … and so on
WWDC provided the above three images for explanation (under the same image quality), which are:
- High-scoring image: composition, lighting, memorable points
- Low-scoring image: no main subject, looks like taken casually or accidentally
- Utility image: technically well-taken but lacks memorable points, like images used for stock photo libraries
iOS ≥ 18 New API: CalculateImageAestheticsScoresRequest
1
2
3
4
5
6
7
8
let request = CalculateImageAestheticsScoresRequest()
let result = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)
// Photo score
print(result.overallScore)
// Whether it is judged as a utility image
print(result.isUtility)
What’s new in Vision? (iOS 18) — Simultaneous detection of body and gesture poses
In the past, only body pose and hand pose could be detected separately.
With this update, developers can detect both body and hand poses simultaneously, combining them into a single request and result, making it more convenient for further feature development.
iOS ≥ 18 New API: DetectHumanBodyPoseRequest
1
2
3
4
5
6
7
8
9
10
11
12
var request = DetectHumanBodyPoseRequest()
// Detect hand pose together
request.detectsHands = true
guard let bodyPose = try await request.perform(on: image). first else { return }
// Body Pose Joints
let bodyJoints = bodyPose.allJoints()
// Left hand Pose Joints
let leftHandJoints = bodyPose.leftHand.allJoints()
// Right hand Pose Joints
let rightHandJoints = bodyPose.rightHand.allJoints()
New Vision API
Apple provides new Swift Vision API wrappers for developers in this update, in addition to basic support for existing functionalities, mainly focusing on enhancing Swift 6 / Swift Concurrency features, providing more efficient and Swift-like API operation methods.
Get started with Vision
The speaker here reintroduced the basic usage of the Vision framework. Apple has encapsulated 31 types of common image recognition requests and their corresponding “Observation” objects (as of iOS 18).
Request: DetectFaceRectanglesRequest - Face area recognition request Result: FaceObservation The previous article “Exploring Vision - Automatically Identify Faces for Avatar Upload in Apps (Swift)” used this pair of requests.
Request: RecognizeTextRequest - Text recognition request Result: RecognizedTextObservation
Request: GenerateObjectnessBasedSaliencyImageRequest - Objectness-based object recognition request Result: SaliencyImageObservation
All 31 types of requests:
Request Purpose | Observation Description |
---|---|
CalculateImageAestheticsScoresRequest Calculate the aesthetic score of the image. | AestheticsObservation Returns the aesthetic score of the image, considering factors like composition and color. |
ClassifyImageRequest Classify the content of the image. | ClassificationObservation Returns the classification labels and confidence of objects or scenes in the image. |
CoreMLRequest Analyze images using Core ML models. | CoreMLFeatureValueObservation Generates observations based on the output of Core ML models. |
DetectAnimalBodyPoseRequest Detect animal poses in images. | RecognizedPointsObservation Returns the skeleton points and their positions of animals. |
DetectBarcodesRequest Detect barcodes in images. | BarcodeObservation Returns barcode data and types (e.g., QR code). |
DetectContoursRequest Detect contours in images. | ContoursObservation Returns detected contour lines in the image. |
DetectDocumentSegmentationRequest Detect and segment documents in images. | RectangleObservation Returns the rectangular boundary positions of documents. |
DetectFaceCaptureQualityRequest Evaluate the quality of face captures. | FaceObservation Returns quality assessment scores for facial images. |
DetectFaceLandmarksRequest Detect facial landmarks. | FaceObservation Returns detailed positions of facial landmarks (e.g., eyes, nose). |
DetectFaceRectanglesRequest Detect faces in images. | FaceObservation Returns the bounding box positions of faces. |
DetectHorizonRequest Detect horizons in images. | HorizonObservation Returns the angle and position of the horizon. |
DetectHumanBodyPose3DRequest Detect 3D human body poses in images. | RecognizedPointsObservation Returns 3D human skeleton points and their spatial coordinates. |
DetectHumanBodyPoseRequest Detect human body poses in images. | RecognizedPointsObservation Returns human skeleton points and their coordinates. |
DetectHumanHandPoseRequest Detect hand poses in images. | RecognizedPointsObservation Returns hand skeleton points and their positions. |
DetectHumanRectanglesRequest Detect humans in images. | HumanObservation Returns the bounding box positions of humans. |
DetectRectanglesRequest Detect rectangles in images. | RectangleObservation Returns the coordinates of the four vertices of rectangles. |
DetectTextRectanglesRequest Detect text regions in images. | TextObservation Returns the positions and bounding boxes of text regions. |
DetectTrajectoriesRequest Detect and analyze object motion trajectories. | TrajectoryObservation Returns motion trajectory points and their time series. |
GenerateAttentionBasedSaliencyImageRequest Generate attention-based saliency images. | SaliencyImageObservation Returns saliency maps of the most attractive areas in the image. |
GenerateForegroundInstanceMaskRequest Generate foreground instance mask images. | InstanceMaskObservation Returns masks of foreground objects. |
GenerateImageFeaturePrintRequest Generate image feature prints for comparison. | FeaturePrintObservation Returns feature fingerprint data of images for similarity comparison. |
GenerateObjectnessBasedSaliencyImageRequest Generate objectness-based saliency images. | SaliencyImageObservation Returns saliency maps of object saliency areas. |
GeneratePersonInstanceMaskRequest Generate person instance mask images. | InstanceMaskObservation Returns masks of person instances. |
GeneratePersonSegmentationRequest Generate person segmentation images. | SegmentationObservation Returns binary images of person segmentation. |
RecognizeAnimalsRequest Detect and identify animals in images. | RecognizedObjectObservation Returns animal types and their confidence levels. |
RecognizeTextRequest Detect and identify text in images. | RecognizedTextObservation Returns detected text content and its spatial positions. |
TrackHomographicImageRegistrationRequest Track homographic image registration. | ImageAlignmentObservation Returns homographic transformation matrices between images for image registration. |
TrackObjectRequest Track objects in images. | DetectedObjectObservation Returns the positions and velocity information of objects in images. |
TrackOpticalFlowRequest Track optical flow in images. | OpticalFlowObservation Returns optical flow vector fields describing pixel movements. |
TrackRectangleRequest Track rectangles in images. | RectangleObservation Returns the positions, sizes, and rotation angles of rectangles in images. |
TrackTranslationalImageRegistrationRequest Track translational image registration. | ImageAlignmentObservation Returns translational transformation matrices between images for image registration. |
- Prefixing VN in front is the old API writing method (before iOS 18)
The speaker mentioned several commonly used Requests as follows.
ClassifyImageRequest
Recognize the input image, obtain label classification and confidence.
[Travelogue] 2024 Second Visit to Kyushu 9-Day Free and Easy Trip, Entering Fukuoka by Busan→Hakata Cruise
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
if #available(iOS 18.0, *) {
// New API using Swift features
let request = ClassifyImageRequest()
Task {
do {
let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*yL3vI1ADzwlovctW5WQgJw.jpeg")!)
observations.forEach {
observation in
print("\(observation.identifier): \(observation.confidence)")
}
}
catch {
print("Request failed: \(error)")
}
}
} else {
// Old method
let completionHandler: VNRequestCompletionHandler = {
request, error in
guard error == nil else {
print("Request failed: \(String(describing: error))")
return
}
guard let observations = request.results as? [VNClassificationObservation] else {
return
}
observations.forEach {
observation in
print("\(observation.identifier): \(observation.confidence)")
}
}
let request = VNClassifyImageRequest(completionHandler: completionHandler)
DispatchQueue.global().async {
let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/cb65fd5ab770/1*3_jdrLurFuUfNdW4BJaRww.jpeg")!, options: [:])
do {
try handler.perform([request])
}
catch {
print("Request failed: \(error)")
}
}
}
Analysis Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
• outdoor: 0.75392926
• sky: 0.75392926
• blue_sky: 0.7519531
• machine: 0.6958008
• cloudy: 0.26538086
• structure: 0.15728651
• sign: 0.14224191
• fence: 0.118652344
• banner: 0.0793457
• material: 0.075975396
• plant: 0.054406323
• foliage: 0.05029297
• light: 0.048126098
• lamppost: 0.048095703
• billboards: 0.040039062
• art: 0.03977703
• branch: 0.03930664
• decoration: 0.036868922
• flag: 0.036865234
....etc
RecognizeTextRequest
Recognize the text content in the image (a.k.a OCR)
[Travelogue] 2023 Tokyo 5-day free trip](../9da2c51fa4f2/)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
if #available(iOS 18.0, *) {
// New API using Swift features
var request = RecognizeTextRequest()
request.recognitionLevel = .accurate
request.recognitionLanguages = [.init(identifier: "ja-JP"), .init(identifier: "en-US")] // Specify language code, e.g., Traditional Chinese
Task {
do {
let observations = try await request.perform(on: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!)
observations.forEach {
observation in
let topCandidate = observation.topCandidates(1).first
print(topCandidate?.string ?? "No text recognized")
}
}
catch {
print("Request failed: \(error)")
}
}
} else {
// Old way
let completionHandler: VNRequestCompletionHandler = {
request, error in
guard error == nil else {
print("Request failed: \(String(describing: error))")
return
}
guard let observations = request.results as? [VNRecognizedTextObservation] else {
return
}
observations.forEach {
observation in
let topCandidate = observation.topCandidates(1).first
print(topCandidate?.string ?? "No text recognized")
}
}
let request = VNRecognizeTextRequest(completionHandler: completionHandler)
request.recognitionLevel = .accurate
request.recognitionLanguages = ["ja-JP", "en-US"] // Specify language code, e.g., Traditional Chinese
DispatchQueue.global().async {
let handler = VNImageRequestHandler(url: URL(string: "https://zhgchg.li/assets/9da2c51fa4f2/1*fBbNbDepYioQ-3-0XUkF6Q.jpeg")!, options: [:])
do {
try handler.perform([request])
}
catch {
print("Request failed: \(error)")
}
}
}
Analysis Result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
LE LABO Aoyama Store
TEL:03-6419-7167
*Thank you for your purchase*
No: 21347
Date: 2023/06/10 14.14.57
Responsible:
1690370
Register: 008A 1
Product Name
Tax-inclusive Price Quantity Tax-inclusive Total
Kaiak 10 EDP FB 15ML
J1P7010000S
16,800
16,800
Another 13 EDP FB 15ML
J1PJ010000S
10,700
10,700
Lip Balm 15ML
JOWC010000S
2,000
1
Total Amount
(Tax Included)
CARD
2,000
3 items purchased
29,500
0
29,500
29,500
DetectBarcodesRequest
Detect barcode and QR code data in the image.
Thai locals recommend Goose Brand Cooling Gel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
let filePath = Bundle.main.path(forResource: "IMG_6777", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
// New API using Swift features
let request = DetectBarcodesRequest()
Task {
do {
let observations = try await request.perform(on: fileURL)
observations.forEach {
observation in
print("Payload: \(observation.payloadString ?? "No payload")")
print("Symbology: \(observation.symbology)")
}
}
catch {
print("Request failed: \(error)")
}
}
} else {
// Old way
let completionHandler: VNRequestCompletionHandler = {
request, error in
guard error == nil else {
print("Request failed: \(String(describing: error))")
return
}
guard let observations = request.results as? [VNBarcodeObservation] else {
return
}
observations.forEach {
observation in
print("Payload: \(observation.payloadStringValue ?? "No payload")")
print("Symbology: \(observation.symbology.rawValue)")
}
}
let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
DispatchQueue.global().async {
let handler = VNImageRequestHandler(url: fileURL, options: [:])
do {
try handler.perform([request])
}
catch {
print("Request failed: \(error)")
}
}
}
Analysis Results:
1
2
3
4
5
6
7
8
Payload: 8859126000911
Symbology: VNBarcodeSymbologyEAN13
Payload: https://lin.ee/hGynbVM
Symbology: VNBarcodeSymbologyQR
Payload: http://www.hongthaipanich.com/
Symbology: VNBarcodeSymbologyQR
Payload: https://www.facebook.com/qr?id=100063856061714
Symbology: VNBarcodeSymbologyQR
RecognizeAnimalsRequest
Recognize animals in the image with confidence.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
let filePath = Bundle.main.path(forResource: "IMG_5026", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
// New API using Swift features
let request = RecognizeAnimalsRequest()
Task {
do {
let observations = try await request.perform(on: fileURL)
observations.forEach {
observation in
let labels = observation.labels
labels.forEach {
label in
print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
}
}
}
catch {
print("Request failed: \(error)")
}
}
} else {
// Old way
let completionHandler: VNRequestCompletionHandler = {
request, error in
guard error == nil else {
print("Request failed: \(String(describing: error))")
return
}
guard let observations = request.results as? [VNRecognizedObjectObservation] else {
return
}
observations.forEach {
observation in
let labels = observation.labels
labels.forEach {
label in
print("Detected animal: \(label.identifier) with confidence: \(label.confidence)")
}
}
}
let request = VNRecognizeAnimalsRequest(completionHandler: completionHandler)
DispatchQueue.global().async {
let handler = VNImageRequestHandler(url: fileURL, options: [:])
do {
try handler.perform([request])
}
catch {
print("Request failed: \(error)")
}
}
}
Analysis Results:
1
Detected animal: Cat with confidence: 0.77245045
Others:
- Detecting human body in images: DetectHumanRectanglesRequest
- Detecting poses of animals and humans (3D or 2D): DetectAnimalBodyPoseRequest, DetectHumanBodyPose3DRequest, DetectHumanBodyPoseRequest, DetectHumanHandPoseRequest
- Detecting and tracking object trajectories (in different frames of videos, animations): DetectTrajectoriesRequest, TrackObjectRequest, TrackRectangleRequest
iOS ≥ 18 Update Highlight:
1
2
3
4
VN*Request -> *Request (e.g. VNDetectBarcodesRequest -> DetectBarcodesRequest)
VN*Observation -> *Observation (e.g. VNRecognizedObjectObservation -> RecognizedObjectObservation)
VNRequestCompletionHandler -> async/await
VNImageRequestHandler.perform([VN*Request]) -> *Request.perform()
WWDC Example
The official WWDC video uses a supermarket product scanner as an example.
Most products have a Barcode that can be scanned
We can obtain the location of the Barcode from observation.boundingBox
, but unlike the common UIView coordinate system, the BoundingBox
’s relative position starts from the lower left corner, with values ranging from 0 to 1.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
let filePath = Bundle.main.path(forResource: "IMG_6785", ofType: "png")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
// New API using Swift features
var request = DetectBarcodesRequest()
request.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
Task {
do {
let observations = try await request.perform(on: fileURL)
if let observation = observations.first {
DispatchQueue.main.async {
self.infoLabel.text = observation.payloadString
// Color layer marking
let colorLayer = CALayer()
// iOS >=18 new coordinate transformation API toImageCoordinates
// Not tested, may need to calculate the offset for ContentMode = AspectFit:
colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
self.baseImageView.layer.addSublayer(colorLayer)
}
print("BoundingBox: \(observation.boundingBox.cgRect)")
print("Payload: \(observation.payloadString ?? "No payload")")
print("Symbology: \(observation.symbology)")
}
}
catch {
print("Request failed: \(error)")
}
}
} else {
// Old approach
let completionHandler: VNRequestCompletionHandler = {
request, error in
guard error == nil else {
print("Request failed: \(String(describing: error))")
return
}
guard let observations = request.results as? [VNBarcodeObservation] else {
return
}
if let observation = observations.first {
DispatchQueue.main.async {
self.infoLabel.text = observation.payloadStringValue
// Color layer marking
let colorLayer = CALayer()
colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
self.baseImageView.layer.addSublayer(colorLayer)
}
print("BoundingBox: \(observation.boundingBox)")
print("Payload: \(observation.payloadStringValue ?? "No payload")")
print("Symbology: \(observation.symbology.rawValue)")
}
}
let request = VNDetectBarcodesRequest(completionHandler: completionHandler)
request.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
DispatchQueue.global().async {
let handler = VNImageRequestHandler(url: fileURL, options: [:])
do {
try handler.perform([request])
}
catch {
print("Request failed: \(error)")
}
}
}
iOS ≥ 18 Update Highlight:
// iOS ≥18 New Coordinate Transformation API toImageCoordinates
observation.boundingBox.toImageCoordinates(CGSize, origin: .upperLeft)
// https://developer.apple.com/documentation/vision/normalizedpoint/toimagecoordinates(from:imagesize:origin:)
Helper:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Generated by ChatGPT 4o
// Since the photo in the ImageView is set with ContentMode = AspectFit
// Extra calculation is needed for the top and bottom offset caused by Fit
func convertBoundingBox(_ boundingBox: CGRect, to view: UIImageView) -> CGRect {
guard let image = view.image else {
return .zero
}
let imageSize = image.size
let viewSize = view.bounds.size
let imageRatio = imageSize.width / imageSize.height
let viewRatio = viewSize.width / viewSize.height
var scaleFactor: CGFloat
var offsetX: CGFloat = 0
var offsetY: CGFloat = 0
if imageRatio > viewRatio {
// Image fits in the width direction
scaleFactor = viewSize.width / imageSize.width
offsetY = (viewSize.height - imageSize.height * scaleFactor) / 2
}
else {
// Image fits in the height direction
scaleFactor = viewSize.height / imageSize.height
offsetX = (viewSize.width - imageSize.width * scaleFactor) / 2
}
let x = boundingBox.minX * imageSize.width * scaleFactor + offsetX
let y = (1 - boundingBox.maxY) * imageSize.height * scaleFactor + offsetY
let width = boundingBox.width * imageSize.width * scaleFactor
let height = boundingBox.height * imageSize.height * scaleFactor
return CGRect(x: x, y: y, width: width, height: height)
}
Output:
1
2
3
BoundingBox: (0.5295758928571429, 0.21408638121589782, 0.0943080357142857, 0.21254415360708087)
Payload: 4710018183805
Symbology: VNBarcodeSymbologyEAN13
Some products do not have a barcode, such as loose fruits with only product labels
Therefore, our scanner also needs to support scanning pure text labels simultaneously.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
let filePath = Bundle.main.path(forResource: "apple", ofType: "jpg")! // Local test image
let fileURL = URL(filePath: filePath)
if #available(iOS 18.0, *) {
// New API using Swift features
var barcodesRequest = DetectBarcodesRequest()
barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
var textRequest = RecognizeTextRequest()
textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
Task {
do {
let handler = ImageRequestHandler(fileURL)
// parameter pack syntax and we must wait for all requests to finish before we can use their results.
// let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)
if let observation = barcodesObservation.first {
DispatchQueue.main.async {
self.infoLabel.text = observation.payloadString
// Color layer
let colorLayer = CALayer()
// New Coordinate Transformation API toImageCoordinates for iOS >=18
// Not tested, may need to consider the offset of ContentMode = AspectFit:
colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
self.baseImageView.layer.addSublayer(colorLayer)
}
print("BoundingBox: \(observation.boundingBox.cgRect)")
print("Payload: \(observation.payloadString ?? "No payload")")
print("Symbology: \(observation.symbology)")
}
textObservation.forEach {
observation in
let topCandidate = observation.topCandidates(1).first
print(topCandidate?.string ?? "No text recognized")
}
}
catch {
print("Request failed: \(error)")
}
}
} else {
// Old approach
let barcodesCompletionHandler: VNRequestCompletionHandler = {
request, error in
guard error == nil else {
print("Request failed: \(String(describing: error))")
return
}
guard let observations = request.results as? [VNBarcodeObservation] else {
return
}
if let observation = observations.first {
DispatchQueue.main.async {
self.infoLabel.text = observation.payloadStringValue
// Color layer
let colorLayer = CALayer()
colorLayer.frame = self.convertBoundingBox(observation.boundingBox, to: self.baseImageView)
colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
self.baseImageView.layer.addSublayer(colorLayer)
}
print("BoundingBox: \(observation.boundingBox)")
print("Payload: \(observation.payloadStringValue ?? "No payload")")
print("Symbology: \(observation.symbology.rawValue)")
}
}
let textCompletionHandler: VNRequestCompletionHandler = {
request, error in
guard error == nil else {
print("Request failed: \(String(describing: error))")
return
}
guard let observations = request.results as? [VNRecognizedTextObservation] else {
return
}
observations.forEach {
observation in
let topCandidate = observation.topCandidates(1).first
print(topCandidate?.string ?? "No text recognized")
}
}
let barcodesRequest = VNDetectBarcodesRequest(completionHandler: barcodesCompletionHandler)
barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcode is needed, it can be specified directly to improve performance
let textRequest = VNRecognizeTextRequest(completionHandler: textCompletionHandler)
textRequest.recognitionLevel = .accurate
textRequest.recognitionLanguages = ["en-US"]
DispatchQueue.global().async {
let handler = VNImageRequestHandler(url: fileURL, options: [:])
do {
try handler.perform([barcodesRequest, textRequest])
}
catch {
print("Request failed: \(error)")
}
}
}
Output:
1
2
3
4
94128s
ORGANIC
Pink Lady®
Produce of USh
iOS ≥ 18 Update Highlight:
1
2
3
4
let handler = ImageRequestHandler(fileURL)
// parameter pack syntax and we must wait for all requests to finish before we can use their results.
// let (barcodesObservation, textObservation, ...) = try await handler.perform(barcodesRequest, textRequest, ...)
let (barcodesObservation, textObservation) = try await handler.perform(barcodesRequest, textRequest)
iOS ≥ 18 performAll( ) method
The previous perform(barcodesRequest, textRequest)
method for handling Barcode scanning and text recognition required both requests to be completed before continuing execution; starting from iOS 18, a new performAll()
method is provided, changing the response method to streaming, allowing corresponding processing as soon as one of the requests is received, such as responding directly when a Barcode is scanned.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
if #available(iOS 18.0, *) {
// New API using Swift features
var barcodesRequest = DetectBarcodesRequest()
barcodesRequest.symbologies = [.ean13] // If only scanning EAN13 Barcodes is needed, it can be specified directly to improve performance
var textRequest = RecognizeTextRequest()
textRequest.recognitionLanguages = [.init(identifier: "zh-Hnat"), .init(identifier: "en-US")]
Task {
let handler = ImageRequestHandler(fileURL)
let observation = handler.performAll([barcodesRequest, textRequest] as [any VisionRequest])
for try await result in observation {
switch result {
case .detectBarcodes(_, let barcodesObservation):
if let observation = barcodesObservation.first {
DispatchQueue.main.async {
self.infoLabel.text = observation.payloadString
// Color layer marking
let colorLayer = CALayer()
// iOS >=18 new coordinate transformation API toImageCoordinates
// Not tested, may still need to calculate the offset for ContentMode = AspectFit:
colorLayer.frame = observation.boundingBox.toImageCoordinates(self.baseImageView.frame.size, origin: .upperLeft)
colorLayer.backgroundColor = UIColor.red.withAlphaComponent(0.5).cgColor
self.baseImageView.layer.addSublayer(colorLayer)
}
print("BoundingBox: \(observation.boundingBox.cgRect)")
print("Payload: \(observation.payloadString ?? "No payload")")
print("Symbology: \(observation.symbology)")
}
case .recognizeText(_, let textObservation):
textObservation.forEach {
observation in
let topCandidate = observation.topCandidates(1).first
print(topCandidate?.string ?? "No text recognized")
}
default:
print("Unrecognized result: \(result)")
}
}
}
}
Optimize with Swift Concurrency
Assuming we have a list of image wall, and each image needs to automatically crop out the main object; this is where we can leverage Swift Concurrency to improve loading efficiency.
Original Implementation
1
2
3
4
5
6
7
8
9
10
11
func generateThumbnail(url: URL) async throws -> UIImage {
let request = GenerateAttentionBasedSaliencyImageRequest()
let saliencyObservation = try await request.perform(on: url)
return cropImage(url, to: saliencyObservation.salientObjects)
}
func generateAllThumbnails() async throws {
for image in images {
image.thumbnail = try await generateThumbnail(url: image.url)
}
}
Executing one at a time, slow efficiency and performance.
Optimization (1) — TaskGroup Concurrency
1
2
3
4
5
6
7
func generateAllThumbnails() async throws {
try await withThrowingDiscardingTaskGroup { taskGroup in
for image in images {
image.thumbnail = try await generateThumbnail(url: image.url)
}
}
}
Adding each Task to TaskGroup Concurrency for execution.
Issue: Image recognition and cropping operations are memory-intensive. Unrestrained parallel tasks may cause user lagging and OOM crashes.
Optimization (2) — TaskGroup Concurrency + Limiting Parallelism
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func generateAllThumbnails() async throws {
try await withThrowingDiscardingTaskGroup {
taskGroup in
// Maximum execution not to exceed 5
let maxImageTasks = min(5, images.count)
// Fill in 5 tasks first
for index in 0..<maxImageTasks {
taskGroup.addTask {
image[index].thumbnail = try await generateThumbnail(url: image[index].url)
}
}
var nextIndex = maxImageTasks
for try await _ in taskGroup {
// When a Task in taskGroup completes await...
// Check if the Index reaches the end
if nextIndex < images.count {
let image = images[nextIndex]
// Continue filling tasks one by one (maintaining at most 5)
taskGroup.addTask {
image.thumbnail = try await generateThumbnail(url: image.url)
}
nextIndex += 1
}
}
}
}
Update an existing Vision app
- Vision will remove CPU and GPU support for some requests on devices with a neural engine. On these devices, the neural engine is the best choice for performance. You can check using the
supportedComputeDevices()
API. - Remove all VN prefixes
VNXXRequest
,VNXXXObservation
->Request
,Observation
- Replace the original VNRequestCompletionHandler with async/await.
- Use
*Request.perform()
directly instead ofVNImageRequestHandler.perform([VN*Request])
.
Wrap-up
- API designed for Swift language features
- New features and methods are Swift Only, available for iOS ≥ 18
- New image scoring feature, body + hand movement tracking
Thanks!
KKday Business Recruitment
👉👉👉This book club sharing is derived from the weekly technical sharing activities within the KKday App Team. The team is currently enthusiastically recruiting Senior iOS Engineer , interested friends are welcome to submit resumes.👈👈👈
Reference
Discover Swift enhancements in the Vision framework
The Vision Framework API has been redesigned to leverage modern Swift features like concurrency, making it easier and faster to integrate a wide array of Vision algorithms into your app. We’ll tour the updated API and share sample code, along with best practices, to help you get the benefits of this framework with less coding effort. We’ll also demonstrate two new features: image aesthetics and holistic body pose.
Chapters
- 0:00 — Introduction
- 1:07 — New Vision API
- 1:47 — Get started with Vision
- 8:59 — Optimize with Swift Concurrency
- 11:05 — Update an existing Vision app
- 13:46 — What’s new in Vision?
Vision framework Apple Developer Documentation
-
Feel free to contact me for any questions or feedback.
===
===
This article was first published in Traditional Chinese on Medium ➡️ View Here