Artificial intelligence (AI) is a non-human model or program that is capable of solving advanced tasks.
Machine learning explained
Machine learning (ML) algorithms are programs that learn to solve tasks. There is no need for a human to explicitly program the machine to perform specific tasks.
How does machine learning work?
Through continuous feedback loops, machine learning models are able to identify patterns and structure in data that they can then use to make inferences and take appropriate actions.
Neural networks explained
A model that is inspired by the structure of the brain. A neural network processes input to obtain an output by applying the composition of many simple mathematical functions. These functions roughly approximate how neurons are excited and inhibited and how they excite and inhibit surrounding neurons.
Supervised and unsupervised learning
Supervised learning is a type of machine learning where the dataset is labeled. The algorithm learns how to match input data to the given label.
Unsupervised learning uses machine learning to cluster and analyze unlabeled datasets. This makes way for discovery of hidden patterns without any sort of human intervention.
The main difference between these two types of learning is that supervised learning uses input and output data that has been labelled, while unsupervised learning does not.
Reinforcement learning (RL) is a type of machine learning where the agent learns to make sequential decisions that maximize a reward function. RL agents learn through trial and error and refine their decisions based on which past decisions have yielded higher or lower reward. Like a child or pet, the algorithm learns what behavior leads to positive or negative rewards based on the reward function it is trying to optimize.
Sim to real
Sim to real refers to the transfer of a model learned in simulation or with synthetic data to a system in the real world or using real-world data.
Ground truth is the expected and correct result with no errors and usually compared against measurements by a system with noise or inherent errors to determine the accuracy of the prediction or measurement.
Computer vision revolves around perceiving the world visually in the same way a person does, leading to an understanding of the environment the system is operating in. For machine learning computer vision applications, cameras are used to capture image data that is then labelled and annotated. Through machine learning, a model learns to understand those images and extrapolate the important aspects that were labeled and learned through the training process to visually detect objects and people and understand the environment.
Synthetic data is data created artificially that does not rely on real-world measurement or situations. Synthetic data not only cuts down the cost and time to collect data, but also offers ways to eliminate bias, increase performance, generate perfect labels, and diversify the data collected.
Images courtesy of Neural Pocket: Structured synthetic data (left), unstructured synthetic data (right)
When creating synthetic data, the environment that provides the context for the computer vision problem may not necessarily resemble a real-world environment. A structured environment usually represents the real-world environment meant to be simulated, such as a building or home interior. Unstructured environments include a highly randomized background with unrelated images or objects with a high degree of variation.
Domain randomization is a synthetic data technique that helps build performant computer vision models by programmatically varying parameters in a dataset. In each frame, the specific objects, their position and orientation, the lighting and camera angles, and many other parameters can vary. This ensures a diverse dataset that can better train your model to handle variations in environmental conditions and edge cases.
Digital images are constructed of pixels each with red, green, and blue values (RGB) usually ranging from 0 to 255. The combination of these three RGB values represent a large number of colors and shades. RGB images are a common output of synthetic data generation.
Annotations in computer vision are anything accompanying an image to aid in the understanding of the image or objects and actions in the image. For example, bounding boxes may be considered annotations, as they are not a part of an image itself – they are present to help a computer vision model understand the image.
Image: 2D bounding boxes (top), 3D bounding boxes (bottom)
Bounding boxes are annotation squares or rectangles placed around objects in images to track or identify said objects. There are two different types of bounding boxes:
- 2D bounding boxes precisely locate and label objects in screen space to help a computer vision model recognize them.
- 3D bounding boxes provide precise coordinates in the world space of object locations.
Image segmentation is a method for marking objects over bounding boxes, achieved by dividing a digital image into separate segments. Since the labels are applied on a per-pixel basis, image segmentation is more precise and is commonly used in computer vision and digital image processing to improve machine learning. Image segmentation can take the form of semantic segmentation or instance segmentation.
Semantic segmentation, also known as class segmentation, provides a clear and precise mask to identify every instance of a class of objects in an image. For instance, all boxes are segmented in red in this image.
Instance segmentation separates and masks all labelled objects uniquely. For example, all boxes are segmented in unique colors in this image.
Panoptic segmentation unifies semantic segmentation (assigning a class label to each pixel) and instance segmentation (detecting and segmenting each object instance). Panoptic segmentation tasks classify all the pixels in the image as belonging to a class label, yet also identify what instance of that class they belong to.
Panoptic segmentation is typically used for:
- Medical imagery, where instances as well as amorphous regions help shape the context.
- Self-driving cars and autonomous vehicles, as it is needed to know what objects are around the vehicle, but also what surface the vehicle is driving on.
- Digital image processing software that needs to have pixel-wise comprehension of the people in the image as well as what comprises the background.
Human keypoint labels
Human keypoint detection is a computer vision problem that involves simultaneously detecting people and localizing the keypoints (interest points). The keypoints describe certain landmarks on the human body, such as the location of the shoulders, wrists, hips, knees, etc. as viewed from the camera angle in the scene. Keypoints can semantically encapsulate the orientation or pose of the human body. By detecting the keypoints, the computer vision model can more easily recognize the pose, movements, and actions of humans. Keypoint labels can describe the 2D x,y image coordinates of the keypoints as viewed from the camera, or the 3D x,y,z spatial positions of the human keypoints in the scene with respect to the camera position or some other point of reference.
For 3D pose estimation, a machine learning model estimates the position and orientation of an object or person from an image or video by estimating the spatial locations of keypoints. Pose estimation can aid in tracking how objects will move in real-world simulations and is used widely across areas such as augmented reality (AR), animation, gaming, and robotics.
Object detection describes the computer vision tasks that involve identifying and detecting certain classes of objects. For instance, in this image, a computer vision system identified this object as a smartphone.
See how the AI startup Neural Pocket improves object detection with synthetic data.
Check out Unity’s AI and machine learning products and learn how they can help you solve diverse problems.