Neural Architecture Search (NAS)

Neural Architecture Search (NAS) automates the design of neural networks. The process involves defining a search space of possible architectural components, a search strategy to explore this space, and an evaluation strategy to score candidate models. The primary goal is to discover the optimal architecture considering metrics like accuracy, latency/FLOPs, model size, and hardware constraints.

A NAS algorithm can manipulate various architectural elements, including:


Core Components of NAS

Search Strategies

Several strategies are employed to navigate the vast search space:

Evaluation Strategies

Efficiently evaluating candidate architectures is crucial:


Notable Applications of NAS

NAS has been instrumental in designing state-of-the-art models, including:


NAS for EfficientNet

EfficientNet's development leveraged a reinforcement learning-based NAS. The search focused on:

The search target was a model optimized for both accuracy and latency on mobile hardware. Candidate models were trained on ImageNet, and latency was measured on actual devices rather than just relying on FLOP counts. An RL controller guided this search, leading to the EfficientNet-B0 architecture—the foundational model of the EfficientNet family.

After identifying B0, NAS was no longer used. Instead, a compound scaling method was introduced to systematically scale B0 to generate larger models (EfficientNet B1-B7). The compact search space and on-device latency measurement were key to EfficientNet's success, even compared to contemporary Vision Transformer models.


Reinforcement Learning (RL) for NAS

In supervised learning, models learn from input/output pairs using a loss function. In Reinforcement Learning (RL), an agent interacts with an environment by taking actions. After each action, it receives a reward. The agent's objective is to learn a policy—a strategy for choosing actions—that maximizes the cumulative long-term reward.

REINFORCE Algorithm

A fundamental RL algorithm is REINFORCE. The core idea is to increase the probability of actions that lead to high rewards. The REINFORCE loss function is:

$$ \text{Loss} = -\log(\text{probability of the action}) \times \text{reward} $$

If the reward is high, minimizing this loss increases the log-probability of the action. This is a type of policy gradient method, where the controller's policy is directly adjusted to favor high-reward actions.

Mapping RL Concepts to NAS

RL Concept NAS Version
Agent Controller (e.g., an RNN)
Action Selecting an architectural choice (e.g., kernel size, layer type)
Environment Training and evaluating the candidate neural network
Reward Performance metric (e.g., accuracy, latency-adjusted accuracy)
Policy How the controller samples actions (architectural choices)

To enable the controller to learn over time, it's often implemented as an RNN. The RNN outputs logits for different architectural choices (actions). The REINFORCE algorithm (policy gradient) then updates the RNN's parameters based on the received reward.

Example: Using RL to Discover the Best Kernel Size

Let's consider finding the optimal kernel size for a convolutional block using RL:

  1. Search Space / Action Space: Define possible kernel sizes, e.g., [3, 5, 7].
  2. Controller: Implement a controller (e.g., an RNN or even a random sampler initially) that selects a kernel size.
  3. Evaluation Loop: For each sampled kernel size:

If the controller outputs a softmax distribution over kernel sizes, and it samples kernel = 3 with a probability of, say, 0.62, and the resulting reward is 0.85, the REINFORCE loss would be:

$$ \text{loss} = -\log(0.62) \times 0.85 \approx -(-0.478) \times 0.85 \approx 0.406 $$

Backpropagation through this loss adjusts the controller's logits to increase the probability of selecting kernel = 3 if the reward is high. Conversely, if the reward is low, the probability of choosing that kernel decreases. Often, a baseline reward (e.g., the moving average of past rewards) is subtracted from the current reward. This encourages actions that perform better than average.

The following Python code demonstrates a simplified version of this process. A controller samples kernel sizes, and it's trained using REINFORCE to maximize a simulated reward. We run 50 iterations of this sampling and learning process. The code will also generate a plot showing the controller's probability of choosing each kernel size over the training steps.


        
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

""" 
    Toy architecture choices: kernel sizes 
    This is the policy that chooses which architecture to build. It's a probability distribution over actions. 
    In RL the controller is the agent. 
    The controller trying to learn a probability distribution over architectural choices. 
    This is action taken by the agent, sampling kernel sizes. You sample from the controller's distribution. 
    It starts with choosing each kernel equally at first. These logits will get transformed to probabilities. Then you sample from this categorical distribution 
    Then you measure the reward from taking this action. This is the environment returning a reward based on the action the agent took. 
    The REINFORCE algorithm updates these logits to prefer higher-reward options, it's updating the contoller. 
    And you repeat with the newly updated controller. Sampling, training, evaluating and improving the controller. 
"""
kernel_sizes = [3, 5, 7]
logits = torch.nn.Parameter(torch.zeros(len(kernel_sizes))) # internal parameters of the controller
optimizer = torch.optim.Adam([logits], lr=0.1)

# Simulated reward function (pretend model performance)
# Let's say kernel=5 is the best
""" 
# In practice, this is how you migth get the reward from accuracy.  
def get_reward(kernel):
    model = SimpleCNN(kernel_size=kernel)
    train_model(model, train_loader)
    accuracy = evaluate_model(model, val_loader)
    return accuracy
# Or maybe a multi-objvective: 
        reward = accuracy - 0.01 * latency - 0.005 * model_size
"""
def get_reward(kernel):
    true_perf = {3: 0.7, 5: 0.9, 7: 0.6} # we're assuming these are the accuracy from training the model
    noise = torch.randn(1).item() * 0.02 # small noise
    return true_perf[kernel] + noise

# Tracking
history = {
    "step": [],
    "chosen_kernel": [],
    "reward": [],
    "probs": []
}

for step in range(50):
    probs = F.softmax(logits, dim=0)
    dist = torch.distributions.Categorical(probs)
    action = dist.sample()
    kernel = kernel_sizes[action.item()]
    reward = get_reward(kernel)
    loss = -dist.log_prob(action) * reward # REINFORCE
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log
    history["step"].append(step)
    history["chosen_kernel"].append(kernel)
    history["reward"].append(reward)
    history["probs"].append(probs.detach().numpy())

# Convert logged data for visualization
probs_over_time = list(zip(*history["probs"]))

# Plot probabilities over time
# The following code would generate a plot if run in a Python environment
with matplotlib.pyplot imported as plt.
plt.figure(figsize=(10, 6))
for i, k in enumerate(kernel_sizes):
     plt.plot(history["step"], probs_over_time[i], label=f"kernel={k}")
plt.title("Controller's Probability of Choosing Each Kernel Size")
plt.xlabel("Training Step")
plt.ylabel("Probability")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
    

Weight Sharing in NAS: The Supernet Approach

As you can imagine, Neural Architecture Search (NAS) can be computationally expensive, primarily because training an entire model for each sampled hyperparameter or architectural configuration takes a significant amount of time. That’s where weight sharing comes into play, offering a more efficient paradigm.

In this approach, you typically build a supernet (also known as a one-shot model) that encompasses many or all of the architectural choices from the search space. For instance, if you're searching for the optimal kernel size, the supernet would be designed to support multiple kernel sizes within a single, larger model structure. The weights for common parts of the network, or even for the different choices themselves, can be shared.

An example of a convolutional block within such a supernet might look like this in PyTorch:


class SuperConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # Using nn.ModuleDict to hold different convolution operations
        self.ops = nn.ModuleDict({
            'k3': nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False), # bias=False is common in blocks followed by BN
            'k5': nn.Conv2d(in_channels, out_channels, kernel_size=5, padding=2, bias=False),
            'k7': nn.Conv2d(in_channels, out_channels, kernel_size=7, padding=3, bias=False),
        })
    
    def forward(self, x, choice):
        # The 'choice' argument determines which path (kernel size) to activate
        return self.ops[choice](x)

# Example of a simplified SuperNet structure
class SuperNet(nn.Module):
    def __init__(self, num_classes=10): # Example: 10 classes for classification
        super().__init__()
        # Assume input channels is 3 (e.g., for RGB images)
        # Output channels for the super conv block could be, e.g., 16
        self.conv_block = SuperConvBlock(in_channels=3, out_channels=16) 
        self.relu = nn.ReLU()
        self.pool = nn.AdaptiveAvgPool2d(1) # Global average pooling
        self.fc = nn.Linear(16, num_classes) # Classifier

    def forward(self, x, arch_choice):
        x = self.conv_block(x, arch_choice)
        x = self.relu(x)
        x = self.pool(x)
        x = x.view(x.size(0), -1) # Flatten
        x = self.fc(x)
        return x
    

The supernet is designed to support all three kernel sizes (k3, k5, k7 in this example), and the key idea is that the "weights" for these operations (or parts of them) can be "shared" or jointly trained. During the search or training phase, in each forward pass, only one path (e.g., one kernel size) is typically activated and trained, often based on a decision from a controller or a sampling strategy.

Each architectural choice (e.g., kernel size) gets trained and selected in turn, but critically, the updates are applied to the shared weights within the supernet. This means you're not training many individual models from scratch but rather training different sub-networks within the larger supernet.

A simplified training loop for the supernet might look like this:


for epoch in range(E):
    for batch_idx, (x, y) in enumerate(train_loader):
        # Randomly select an architectural choice for this batch/step
        # In a more advanced setup, a controller or search strategy makes this choice
        choice = random.choice(['k3', 'k5', 'k7'])
        
        pred = supernet(x, arch_choice=choice)
        loss = F.cross_entropy(pred, y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    

During this training process, only a part of the supernet (the selected path) is active and updated in each step. A controller might sample the desired architecture, or paths might be sampled uniformly or according to a schedule. The performance of these sampled sub-architectures is evaluated, often using these shared weights.

Once the supernet training is complete, a separate search phase might occur where different sub-architectures are evaluated using the trained supernet weights. The final, top-performing subnet (the "discovered" architecture) is then typically retrained from scratch to achieve its optimal performance, as the shared training process can sometimes lead to slight miscalibrations for any single architecture.

This weight-sharing approach might seem counterintuitive because individual sub-architectures might not fully converge optimally during the supernet training. However, this "non-convergence" (or rather, coupled training) is by design and is precisely what makes this approach significantly faster and cheaper than training every candidate model independently. The goal of supernet training is often to quickly estimate the relative potential of different architectural choices. You're not aiming to train one model to perfect convergence immediately, but rather to train a population of potential sub-models simultaneously, allowing each to receive partial updates and guiding the search towards promising regions of the architecture space. The weights of each operator within the supernet are gradually improved over many such updates.