Recognize text in images with ML Kit on iOS

You can use ML Kit to recognize text in images or video, such as the text of a street sign. The main characteristics of this feature are:

Text Recognition v2 API
DescriptionRecognize text in images or videos, support for Latin, Chinese, Devanagari, Japanese and Korean scripts and a wide range of languages.
SDK namesGoogleMLKit/TextRecognition
GoogleMLKit/TextRecognitionChinese
GoogleMLKit/TextRecognitionDevanagari
GoogleMLKit/TextRecognitionJapanese
GoogleMLKit/TextRecognitionKorean
ImplementationAssets are statically linked to your app at build time
App size impactAbout 38 MB per script SDK
PerformanceReal-time on most devices for Latin script SDK, slower for others.

Try it out

  • Play around with the sample app to see an example usage of this API.
  • Try the code yourself with the codelab.

Before you begin

  1. Include the following ML Kit pods in your Podfile:
    # To recognize Latin script
    pod 'GoogleMLKit/TextRecognition', '3.2.0'
    # To recognize Chinese script
    pod 'GoogleMLKit/TextRecognitionChinese', '3.2.0'
    # To recognize Devanagari script
    pod 'GoogleMLKit/TextRecognitionDevanagari', '3.2.0'
    # To recognize Japanese script
    pod 'GoogleMLKit/TextRecognitionJapanese', '3.2.0'
    # To recognize Korean script
    pod 'GoogleMLKit/TextRecognitionKorean', '3.2.0'
    
  2. After you install or update your project's Pods, open your Xcode project using its .xcworkspace. ML Kit is supported in Xcode version 12.4 or greater.

1. Create an instance of TextRecognizer

Create an instance of TextRecognizer by calling +textRecognizer(options:), passing the options related to the SDK you declared as dependency on above:

Swift

// When using Latin script recognition SDK
let latinOptions = TextRecognizerOptions()
let latinTextRecognizer = TextRecognizer.textRecognizer(options:options)

// When using Chinese script recognition SDK
let chineseOptions = ChineseTextRecognizerOptions()
let chineseTextRecognizer = TextRecognizer.textRecognizer(options:options)

// When using Devanagari script recognition SDK
let devanagariOptions = DevanagariTextRecognizerOptions()
let devanagariTextRecognizer = TextRecognizer.textRecognizer(options:options)

// When using Japanese script recognition SDK
let japaneseOptions = JapaneseTextRecognizerOptions()
let japaneseTextRecognizer = TextRecognizer.textRecognizer(options:options)

// When using Korean script recognition SDK
let koreanOptions = KoreanTextRecognizerOptions()
let koreanTextRecognizer = TextRecognizer.textRecognizer(options:options)

Objective-C

// When using Latin script recognition SDK
MLKTextRecognizerOptions *latinOptions = [[MLKTextRecognizerOptions alloc] init];
MLKTextRecognizer *latinTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options];

// When using Chinese script recognition SDK
MLKChineseTextRecognizerOptions *chineseOptions = [[MLKChineseTextRecognizerOptions alloc] init];
MLKTextRecognizer *chineseTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options];

// When using Devanagari script recognition SDK
MLKDevanagariTextRecognizerOptions *devanagariOptions = [[MLKDevanagariTextRecognizerOptions alloc] init];
MLKTextRecognizer *devanagariTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options];

// When using Japanese script recognition SDK
MLKJapaneseTextRecognizerOptions *japaneseOptions = [[MLKJapaneseTextRecognizerOptions alloc] init];
MLKTextRecognizer *japaneseTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options];

// When using Korean script recognition SDK
MLKKoreanTextRecognizerOptions *koreanOptions = [[MLKKoreanTextRecognizerOptions alloc] init];
MLKTextRecognizer *koreanTextRecognizer = [MLKTextRecognizer textRecognizerWithOptions:options];

2. Prepare the input image

Pass the image as a UIImage or a CMSampleBufferRef to the TextRecognizer's process(_:completion:) method:

Create a VisionImage object using a UIImage or a CMSampleBuffer.

If you use a UIImage, follow these steps:

  • Create a VisionImage object with the UIImage. Make sure to specify the correct .orientation.

    Swift

    let image = VisionImage(image: UIImage)
    visionImage.orientation = image.imageOrientation

    Objective-C

    MLKVisionImage *visionImage = [[MLKVisionImage alloc] initWithImage:image];
    visionImage.orientation = image.imageOrientation;

If you use a CMSampleBuffer, follow these steps:

  • Specify the orientation of the image data contained in the CMSampleBuffer.

    To get the image orientation:

    Swift

    func imageOrientation(
      deviceOrientation: UIDeviceOrientation,
      cameraPosition: AVCaptureDevice.Position
    ) -> UIImage.Orientation {
      switch deviceOrientation {
      case .portrait:
        return cameraPosition == .front ? .leftMirrored : .right
      case .landscapeLeft:
        return cameraPosition == .front ? .downMirrored : .up
      case .portraitUpsideDown:
        return cameraPosition == .front ? .rightMirrored : .left
      case .landscapeRight:
        return cameraPosition == .front ? .upMirrored : .down
      case .faceDown, .faceUp, .unknown:
        return .up
      }
    }
          

    Objective-C

    - (UIImageOrientation)
      imageOrientationFromDeviceOrientation:(UIDeviceOrientation)deviceOrientation
                             cameraPosition:(AVCaptureDevicePosition)cameraPosition {
      switch (deviceOrientation) {
        case UIDeviceOrientationPortrait:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationLeftMirrored
                                                                : UIImageOrientationRight;
    
        case UIDeviceOrientationLandscapeLeft:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationDownMirrored
                                                                : UIImageOrientationUp;
        case UIDeviceOrientationPortraitUpsideDown:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationRightMirrored
                                                                : UIImageOrientationLeft;
        case UIDeviceOrientationLandscapeRight:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationUpMirrored
                                                                : UIImageOrientationDown;
        case UIDeviceOrientationUnknown:
        case UIDeviceOrientationFaceUp:
        case UIDeviceOrientationFaceDown:
          return UIImageOrientationUp;
      }
    }
          
  • Create a VisionImage object using the CMSampleBuffer object and orientation:

    Swift

    let image = VisionImage(buffer: sampleBuffer)
    image.orientation = imageOrientation(
      deviceOrientation: UIDevice.current.orientation,
      cameraPosition: cameraPosition)

    Objective-C

     MLKVisionImage *image = [[MLKVisionImage alloc] initWithBuffer:sampleBuffer];
     image.orientation =
       [self imageOrientationFromDeviceOrientation:UIDevice.currentDevice.orientation
                                    cameraPosition:cameraPosition];

3. Process the image

Then, pass the image to the process(_:completion:) method:

Swift

textRecognizer.process(visionImage) { result, error in
  guard error == nil, let result = result else {
    // Error handling
    return
  }
  // Recognized text
}

Objective-C

[textRecognizer processImage:image
                  completion:^(MLKText *_Nullable result,
                               NSError *_Nullable error) {
  if (error != nil || result == nil) {
    // Error handling
    return;
  }
  // Recognized text
}];

4. Extract text from blocks of recognized text

If the text recognition operation succeeds, it returns a Text object. A Text object contains the full text recognized in the image and zero or more TextBlock objects.

Each TextBlock represents a rectangular block of text, which contain zero or more TextLine objects. Each TextLine object contains zero or more TextElement objects, which represent words and word-like entities such as dates and numbers.

For each TextBlock, TextLine, and TextElement object, you can get the text recognized in the region and the bounding coordinates of the region.

For example:

Swift

let resultText = result.text
for block in result.blocks {
    let blockText = block.text
    let blockLanguages = block.recognizedLanguages
    let blockCornerPoints = block.cornerPoints
    let blockFrame = block.frame
    for line in block.lines {
        let lineText = line.text
        let lineLanguages = line.recognizedLanguages
        let lineCornerPoints = line.cornerPoints
        let lineFrame = line.frame
        for element in line.elements {
            let elementText = element.text
            let elementCornerPoints = element.cornerPoints
            let elementFrame = element.frame
        }
    }
}

Objective-C

NSString *resultText = result.text;
for (MLKTextBlock *block in result.blocks) {
  NSString *blockText = block.text;
  NSArray<MLKTextRecognizedLanguage *> *blockLanguages = block.recognizedLanguages;
  NSArray<NSValue *> *blockCornerPoints = block.cornerPoints;
  CGRect blockFrame = block.frame;
  for (MLKTextLine *line in block.lines) {
    NSString *lineText = line.text;
    NSArray<MLKTextRecognizedLanguage *> *lineLanguages = line.recognizedLanguages;
    NSArray<NSValue *> *lineCornerPoints = line.cornerPoints;
    CGRect lineFrame = line.frame;
    for (MLKTextElement *element in line.elements) {
      NSString *elementText = element.text;
      NSArray<NSValue *> *elementCornerPoints = element.cornerPoints;
      CGRect elementFrame = element.frame;
    }
  }
}

Input image guidelines

  • For ML Kit to accurately recognize text, input images must contain text that is represented by sufficient pixel data. Ideally, each character should be at least 16x16 pixels. There is generally no accuracy benefit for characters to be larger than 24x24 pixels.

    So, for example, a 640x480 image might work well to scan a business card that occupies the full width of the image. To scan a document printed on letter-sized paper, a 720x1280 pixel image might be required.

  • Poor image focus can affect text recognition accuracy. If you aren't getting acceptable results, try asking the user to recapture the image.

  • If you are recognizing text in a real-time application, you should consider the overall dimensions of the input images. Smaller images can be processed faster. To reduce latency, ensure that the text occupies as much of the image as possible, and capture images at lower resolutions (keeping in mind the accuracy requirements mentioned above). For more information, see Tips to improve performance.

Tips to improve performance

  • For processing video frames, use the results(in:) synchronous API of the detector. Call this method from the AVCaptureVideoDataOutputSampleBufferDelegate's captureOutput(_, didOutput:from:) function to synchronously get results from the given video frame. Keep AVCaptureVideoDataOutput's alwaysDiscardsLateVideoFrames as true to throttle calls to the detector. If a new video frame becomes available while the detector is running, it will be dropped.
  • If you use the output of the detector to overlay graphics on the input image, first get the result from ML Kit, then render the image and overlay in a single step. By doing so, you render to the display surface only once for each processed input frame. See the updatePreviewOverlayViewWithLastFrame in the ML Kit quickstart sample for an example.
  • Consider capturing images at a lower resolution. However, also keep in mind this API's image dimension requirements.
  • To avoid potential performance degradation, do not run multiple TextRecognizer instances with different script options concurrently.