Sorting apples by color sounds simple enough. In fact, it is the kind of project that often produces an impressive demo in a matter of hours. Put a camera above a conveyor, detect red apples, move a robot arm, and you have something that works.
At least until the real world shows up.
Lighting changes. Fruit overlaps. Apples touch each other. Surfaces reflect light differently. A bruise that is invisible today may show up days later after the product has already been sorted and shipped. That is where the engineering effort begins.
In this article, we will walk through the complete robotic color-sorting pipeline, from image acquisition to motion control, and focus on the challenges that appear when moving from a successful proof of concept to a production-ready system.
At a high level, the process looks straightforward:
A camera captures an image of each apple under controlled lighting. Vision software separates the fruit from the background and measures its color. A classifier determines which destination bin it belongs in. A motion planner generates a path to the target. Finally, the gripper applies enough force to securely pick the apple without damaging it.
The important detail is that none of these stages operate independently. Every step consumes part of the cycle-time budget, and the slowest stage determines the maximum throughput of the entire cell.
The loop
Whether you are sorting apples, inspecting manufactured parts, or separating recyclable materials, most robotic sorting systems follow the same fundamental pattern:
SENSE segment and measure the apple → PLAN classify it and select a grasp strategy → ACT execute the motion under force control → repeat
The system builds a model of the environment, decides what action to take, and executes that action before the opportunity disappears.
Breaking the system into these three stages is useful because each stage tends to fail in different ways.
The sensing stage is limited by optics and lighting. If the image quality is poor, no amount of downstream processing can fully recover the missing information.
The planning stage fails when classifications become ambiguous, grasp candidates collide with neighboring objects, or the system encounters conditions that were never accounted for during design.
The act stage fails when motion plans approach singularities, commands arrive too late, or contact forces exceed what the product can tolerate.
Of course, these stages are not truly independent. They all compete for the same timing budget. Image exposure, data transfer, inference, trajectory generation, and arm motion all need to fit within the time available for a single pick.
The simulation below demonstrates this loop using a Fanuc CR-7iA/L six-axis collaborative robot sorting red, green, and yellow apples.
A collaborative robot is a useful example because it makes the constraints of the act stage easier to see. Unlike high-speed industrial robots that rely on guarding and separation, collaborative robots trade top-end speed for safe operation around people. As a result, force limits and compliance considerations become highly visible design constraints that influence every pick.
Sense: the perception pipeline
Most color-sorting problems are won or lost before the first line of vision code runs.
The camera, optics, and lighting setup determine the quality of the data entering the system. If those fundamentals are not controlled, the software ends up spending its time compensating for variability that should never have existed in the first place.
That is why production systems typically lock exposure and white balance rather than leaving them on automatic settings. A vision algorithm cannot be validated against a sensor that is continuously re-tuning itself between frames. Consistency matters more than adaptability.
Lighting follows the same principle. The goal is to eliminate unknowns by controlling the illumination environment. Most systems use diffuse lighting, often a dome light or LED ring, so the only light reaching the product is the light the designer intended.
This becomes especially important when dealing with fruit. Waxy surfaces create specular reflections from point light sources, and those reflections can distort color measurements. A blown-out highlight may be interpreted as the wrong hue or provide no useful color information at all.
Segmenting the fruit
Once a clean image has been captured, the next step is segmentation.
In a controlled environment, segmentation can be surprisingly simple. If the floor or conveyor is a color that no apple will ever exhibit, each pixel can be compared against the known background color. Connected-component labeling then groups the remaining foreground pixels into individual objects.
The complication is one every vision engineer eventually encounters: objects touching each other.
Two apples resting against one another often appear as a single connected region. At that point, simple foreground extraction is no longer enough.
A common solution is a watershed transform, which separates the merged region into individual objects. When color information is not sufficient, depth data can help. A depth camera can distinguish between neighboring apples with identical hues because they still appear as separate physical mounds in range data.
Measuring color
Only after segmentation is complete does color measurement begin.
And despite what many first implementations assume, that measurement is rarely performed directly in RGB space.
Instead, pixels are typically converted to HSV. Hue remains relatively stable across lighting intensity changes, making it a much better feature for classification. A red apple in shadow and the same apple under a spotlight may produce very different RGB values, but their hue values will remain close together.
Before calculating a final color measurement, pixels with extremely high value or very low saturation are removed. This masks out glare, reflections, and shadow boundaries that would otherwise skew the result.
The remaining pixels are then reduced to a single representative hue.
Because hue wraps around a circular scale, standard averaging techniques can produce incorrect results. Red, for example, sits at the point where the hue wheel wraps through zero degrees. For that reason, color reduction relies on circular statistics.
In many systems, the median hue is preferred because it is more resistant to glare and outliers. Where cycle time is critical, a mean calculation may be used instead.
Either way, the measurement is typically based on thousands of pixels, ensuring that a single bright reflection does not disproportionately influence the final classification.
By the time the sensing stage is complete, the system is no longer working with images. It is working with measurements.
What gets passed to the next stage is a compact and defensible description of the scene: the position of each apple, its measured hue, and an indication of how confidently that hue was resolved.
Plan: classification and the intercept problem
Once you have measured color reliably, classification is often the easy part.
For well-separated colors, a simple threshold-based classifier is difficult to beat. Red apples might fall below roughly 25 degrees or above 340 degrees on the hue wheel. Yellow might occupy the 40-to-70-degree range, while green falls between 75 and 150 degrees.
The important detail is that those ranges do not cover the entire hue spectrum. Production systems need a defined response for colors that land between the expected bands rather than automatically forcing them into the nearest category.
Threshold-based classifiers may seem unsophisticated compared to machine learning models, but they are often the right engineering choice when the problem allows it. They are deterministic, easy to explain, and can be verified without collecting a dataset. Most importantly, their behavior on previously unseen inputs is fully defined.
That is a property many learned models cannot guarantee.
When machine learning makes sense
A learned classifier becomes valuable when the decision boundary is no longer a simple threshold in hue space.
Tasks like ripeness grading, bruise detection, or distinguishing between two cultivars that are both predominantly red are difficult, or impossible, to reduce to a hand-written rule. In those cases, a machine learning model can capture relationships that would be impractical to encode manually.
That added capability comes with real engineering costs.
The training dataset must represent every lighting condition, product variation, and seasonal change the system will encounter. Performance on out-of-distribution inputs is no longer guaranteed. Validation becomes a statistical exercise rather than a specification-based one.
The goal is not to deploy the most sophisticated classifier available. The goal is to deploy the simplest classifier that satisfies the requirements.
That is often the more important engineering decision.
Choosing what to pick
Classification determines where an apple should go. Planning determines how to get it there.
With apples resting on a floor or conveyor, the planner needs to answer two questions:
- Which apple should be picked next?
- What grasp can be executed safely and reliably?
That means evaluating approach direction, wrist orientation, and gripper opening width while ensuring the selected grasp does not interfere with neighboring fruit.
The planner must also respect the robot’s physical constraints. Targets that require awkward elbow flips or approach the limits of the robot’s reachable workspace are typically discarded in favor of more accessible candidates.
Pick ordering matters as well. A planner that constantly sends the arm across the entire workspace may satisfy the task requirements while delivering poor throughput. Efficient systems choose pick sequences that minimize unnecessary motion and keep cycle times predictable.
Handling ambiguous cases
The most interesting planning problems usually are not the easy ones.
Consider a half-red, half-green apple.
If the perception pipeline collapses the entire surface into a single hue value, the result becomes misleading. The hue distribution is bimodal, not centered around a single color. In fact, averaging red and green often produces a value that falls into the yellow range, a classification that does not describe the fruit at all.
This is a good example of why understanding the underlying data matters.
In the simulation, bicolor apples are assigned their own category rather than being forced into either the red or green bin. The broader lesson applies well beyond fruit sorting:
Model the classes that actually exist in the real world and make sure every one of them has a defined destination.
Systems tend to fail when engineers assume the world will conform to the categories they have created. Reliable systems are built by recognizing the categories the world already contains.
Act: kinematics, trajectories, and contact
By the time the planning stage is complete, the robot knows two things: where it needs to pick the apple and where it needs to place it.
The challenge now is turning those goals into motion.
That sounds straightforward until you remember that robot controllers do not execute poses. They execute joint commands.
From pose to joint angles
The first step is inverse kinematics: converting a desired end-effector pose into a set of joint angles the robot can actually achieve.
For a simple two-link planar arm, inverse kinematics is a relatively clean geometry problem. Given the link lengths and target position, the elbow angle can be computed using the law of cosines. The shoulder angle follows from the bearing to the target along with a correction term. The primary choice is whether to use the elbow-up or elbow-down solution.
A six-axis robot is a different story.
For most target poses, multiple valid joint configurations exist. The planner must decide which one to use while considering several competing constraints:
- Continuity with the current robot configuration
- Distance from joint limits
- Clearance from obstacles
- Avoidance of singularities
Singularities deserve special attention because they can create serious control problems. At a singular configuration, the robot’s Jacobian loses rank, meaning a small Cartesian motion may require extremely large joint velocities. A path that looks perfectly reasonable in task space can become impractical or impossible to execute safely.
Trajectory generation
Even after inverse kinematics produces a valid set of joint angles, the robot still is not ready to move.
Joint positions are targets, not commands.
The controller must generate a trajectory between the current state and the desired state while respecting velocity and acceleration constraints.
One common approach is a trapezoidal velocity profile, which limits acceleration during motion. Another is an S-curve profile, which also limits jerk by smoothing transitions between acceleration phases.
That distinction matters more than it might seem.
Limiting jerk is not simply about creating smoother motion. It reduces structural vibration, improves placement accuracy, and limits the inertial loads transferred to whatever the robot is carrying.
For delicate products like fruit, those effects become immediately visible.
The resulting trajectory is then tracked by servo loops running at kilohertz rates, continuously receiving updated setpoints from the motion planner and correcting for any deviation from the desired path.
When motion becomes contact
Up to this point, the robot has been moving through free space.
The moment the gripper touches the apple, the problem changes completely.
Motion control alone is no longer enough. The system now needs to manage force.
Apples bruise under relatively modest pressure. A damaged apple may look perfectly fine when it leaves the sorting cell, only to develop visible browning days later after it reaches a store shelf.
That means successful picking requires more than accurate positioning. It requires controlled interaction with the product.
In practice, production systems address this through a combination of hardware compliance and software feedback.
The hardware side may include:
- Soft fingers that conform to the product surface
- Vacuum cups that lift without squeezing
- Compliant end-effectors designed to absorb small positioning errors
The software side focuses on limiting applied force.
Grip force may be bounded through motor current limits, terminated at first contact using force-torque sensors, or adjusted dynamically based on measured interaction forces.
The exact implementation varies, but the principle remains the same.
Once a robot begins interacting with the physical world, position feedback alone is no longer sufficient.
The control system needs visibility into what forces it is applying, not just confirmation of where it moved.
That is true whether you are handling apples, assembling electronics, loading medical devices, or manipulating automotive components. Whenever contact matters, successful control depends on measuring and responding to the interaction itself rather than assuming the world behaved exactly as commanded.
From the simulation to a production line
Simulations are valuable because they simplify reality.
They allow engineers to isolate a problem, understand system behavior, and validate ideas without the cost and complexity of physical hardware. But every simulation achieves that clarity by holding part of the world constant.
Building a production system means adding that complexity back in.
For a robotic sorting cell, there are four major realities that a simulation typically keeps under control: noisy perception, real-time deadlines, system failures, and long-term drift.
Reality #1: perception is never this clean
In the simulation, every apple is clearly visible, evenly illuminated, and neatly separated from its neighbors.
Production environments rarely offer those conditions.
Real fruit is mottled. Apples overlap. Neighboring objects occlude one another. Some fruit arrives stem-up, hiding the very surface features the classifier would prefer to see.
This is exactly why the perception pipeline includes segmentation, masking, confidence measures, and voting strategies.
A vision system should be designed with the expectation that the input data will be imperfect, because it always is.
Reality #2: production systems have deadlines
A missed frame in a browser simulation might produce a brief visual stutter.
A missed deadline in a production cell is a different problem entirely.
Real systems operate under throughput requirements, safety constraints, and deterministic timing expectations. Collaborative robots must respond within bounded time limits when a person or unexpected object enters the workspace.
That is why production systems are typically built around deterministic execution environments such as an RTOS or a real-time Linux kernel. The goal is not just to be fast on average. The system has to meet its timing requirements every time.
A demonstration can succeed when software is usually fast enough, but a production system requires proof that it will be fast enough every time. That distinction becomes increasingly important whenever software is controlling physical hardware near people.
Reality #3: things fail
One reason simulations look reliable is that failures rarely exist unless someone explicitly models them.
The simulated robot never drops an apple because no failure mode was implemented that allows it to do so.
Production systems do not have that luxury. Every reasonably foreseeable fault requires both detection and a defined response.
Examples include:
- A dropped apple
- A missed grasp
- A camera returning black frames
- An encoder reporting values inconsistent with commanded motion
- A gripper reporting a successful grasp when nothing was actually picked
Handling those failures is only part of the problem. The fault-handling mechanisms themselves also need monitoring.
A watchdog may restart the controller if the control loop becomes unresponsive. A heartbeat mechanism may continuously verify that a safety monitor is still operating correctly. Diagnostic systems need their own validation paths to ensure they can be trusted when something goes wrong.
In mature robotic systems, a surprisingly large percentage of the codebase is dedicated to fault detection, fault recovery, and system monitoring. The nominal workflow of see apple, pick apple, place apple is often the smallest part of the software.
Reality #4: systems drift
Nothing ages inside a simulation. Production equipment does.
LED output changes over time. Lenses accumulate dust. Grippers wear. Mechanical tolerances shift. Even the product itself may vary from season to season.
A sorting system that performed well during factory acceptance testing may gradually lose accuracy if those changes are not monitored and addressed. That is why production systems need mechanisms for detecting drift before it becomes a quality issue.
A well-designed cell tracks its own measurement statistics and monitors for distribution shifts that indicate the operating environment is changing. The goal is to identify those trends before they become visible as misclassifications, throughput losses, or product defects.
A system that was verified once and never re-evaluated has an unknown level of accuracy thereafter.
The difference between a demo and a product
The simulation demonstrates the core architecture of the system: perception, decision-making, motion planning, and control.
A production deployment uses the same architecture. What changes is the amount of engineering required around it.
Real-world systems must tolerate uncertainty, meet deadlines, detect failures, and adapt to changing conditions over time. Those requirements often consume far more effort than the nominal workflow itself.
Building a realistic simulation takes serious engineering. Deploying that same capability on a production line adds another set of challenges: unpredictable inputs, real-time constraints, equipment wear, fault recovery, and long-term reliability.
The underlying architecture may be the same, but production systems are ultimately judged not by how well they work on a good day, but by how consistently they perform every day.
