Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities

We follow the process depicted in Fig. 2 in developing models and conducting assessments. Figure 2A shows a conventional linear training pipeline where a base model undergoes Continued Pre-Training (CPT), followed by Supervised Fine-Tuning (SFT), and then optimized using methods like Direct Preference Optimization (DPO) or Odds Ratio Preference Optimization (ORPO) to produce a trained model. Figure 2B shows an alternative pipeline where, after CPT, SFT, and optimization (e.g., DPO, ORPO), the model is further enhanced by merging it with another fine-tuned model (e.g., a general-purpose model). We note that model merging can be done with models extracted from various training stages, such as after CPT, SFT or at the final stage. The implementation process is structured sequentially, with each step building upon the previous one to enhance the model’s capabilities in stages. As summarized in Table 1, CPT exposes the model to domain-relevant data, while SFT delivers task-specific instruction using labeled datasets in formats such as question-answer or instruction-response. Preference optimization methods (e.g., DPO or ORPO) align the model with human or baseline preferences established in prior steps. Finally, model merging integrates the strengths of different training paths, leading to improved generalization and, in many cases, the emergence of new capabilities absent from individual models. This structured progression ensures a balance between task-specific accuracy and user preference alignment by addressing knowledge acquisition and preference tuning at distinct stages.

**Fig. 2: Model training, merging and assessment stages.**

For the purpose of the analysis, we go into the details of model merging strategies. In this work, we focus on Spherical Linear Interpolation (SLERP, details see “Materials and methods” section), as we found it to be the most effective method. SLERP is a mathematical technique originally introduced in the field of computer graphics for smoothly interpolating between rotations represented by quaternions³⁰. SLERP has found widespread application in various domains that require smooth transitions between orientations or states, including robotics, physics simulations, as well as real-time graphics. For instance, in robotics, SLERP is used for the practical parameterization of rotations, allowing for seamless motion planning and control³¹. In physics simulations and computer graphics, SLERP is crucial for visualizing and animating rotations in a way that preserves the continuity and smoothness of motion^32,33. By maintaining the geometric relationships between interpolated states, SLERP ensures that transitions are both smooth and physically meaningful, making it a useful tool in scenarios where precise and continuous interpolation is required. Figure 3 shows the basic concepts of SLERP (versus linear interpolation, LERP), visually. A key aspect of this strategy is that the smooth, nonlinear path helps to preserve the underlying structure of the model parameters. The sphere in this context represents the inherent structure of the model’s parameter space, and by maintaining the geometric relationship between the parameters, SLERP ensures that the interpolation respects this original structure and does not puncture it (as linear combination of points would), leading to a more meaningful and coherent blending of capabilities rather than random, unstructured changes. Because the merged points are both congruent with the model geometry (that is, they lie on the sphere used here for demonstration) and because they realize new points previously not accessed, emergent features and capabilities could potentially be unlocked. The smoothness and spherical symmetry assumed in SLERP help preserve angular relationships between parameters, avoiding high-loss regions typically encountered in linear interpolation. This capability is especially beneficial for materials science applications, where parameter space discontinuities and asymmetries are common. SLERP enables better generalization and the development of emergent capabilities, making it a powerful tool in this domain.

Fig. 3: Comparison of SLERP (Spherical Linear Interpolation) and LERP (Linear Interpolation) between two points on a unit sphere, illustrating their application in merging Large Language Model (LLM) parameters.

In the following, we present a series of results from assessment experiments conducted with different model families and training/merging strategies (details on training, models, datasets, and assessment benchmarks, see “Materials and methods” section). Figure 4 depicts a series of performance evaluations of Llama-3.1 Model variants across benchmarks. We use two basic models as the foundation for our training. First, meta-llama/Meta-Llama-3.1-8B, the base model of the Llama family that has not been fine-tuned and aligned. Second, the meta-llama/Meta-Llama-3.1-8B-Instruct model that has been fine-tuned and aligned to conduct question-answer interactions, along with a host of other capabilities⁶. Except for the LoRA case¹⁴, all of our experiments include CPT (see Table 1 for an overview of the training stages and acronyms used) as the first step, with the aim to endow the base model with domain knowledge from our materials science corpus of papers and distilled, extracted and processed data sourced from scientific studies. We then implement a range of variations, such as CPT only, CPT-SFT, CPT-SFT-ORPO and CPT-SFT-DPO. At each stage, we also implement model merging with the meta-llama/Meta-Llama-3.1-8B-Instruct model. Overall, the results reveal that the models that have undergone SLERP merging (especially those combined with DPO and ORPO strategies) generally show the highest accuracy across benchmarks. The best strategy without model merging is found to be the Instruct-CPT-SFT-DPO strategy.

**Fig. 4: Performance evaluation P of Llama-3.1 model variants across benchmarks.**

We now conduct the same series of experiments using Mistral-v0.3 model variants⁴ across benchmarks. As in the previous set of results, we use the same dataset across all cases, and we present both non-merged cases and merges with the mistralai/Mistral-7B-Instruct-v0.3 model. Figure 5 depicts an overview of the performance evaluations across benchmarks for this case. As before, the results show that these models that have also undergone SLERP merging generally show the highest accuracy across benchmarks. The best strategy without model merging is found to be the Base-CPT-SFT strategy, albeit the performance of the Instruct-CPT-SFT strategy is very similar.

**Fig. 5: Performance evaluation P of Mistral-7B-v0.3 model variants.**

The CPT stage involves five epochs. To explore the effect of the number of epochs in this phase, we computed the performance of the direct CPT-SLERP merges for the Mistral models from different training epochs. It is noted that the original merges assessed (and SFT, DPO/ORPO training stages) in Fig. 5 were conducted based on CPT results from epoch 5. Figure 6 depicts a comparison of averaged scores across different epochs for both the Base and Instruct models, using the SLERP method. Figure 6A shows an overview of the results, in a similar format as the earlier performance assessments, depicting performance across all models and variants of CPT epochs used. Figure 6B shows the performance of the Instruct model as a function of the number of CPT training epochs, and Fig. 6C illustrates the performance of the Base model. We can see that the Instruct model demonstrates a consistent improvement in performance with each epoch, peaking at the best score by epoch 5, indicating that it benefits significantly from continued training. In contrast, the Base model shows a more fluctuating performance, with its highest score at epoch 1, followed by slight declines and only a minor recovery at epoch 5. This suggests that while the Base model starts strong, it does not consistently improve with additional training, potentially indicating a saturation point. Both models, however, consistently outperform the baseline score set by the original mistralai/Mistral-7B-Instruct-v0.3 model, underscoring the effectiveness of the SLERP method, and consistent with the earlier results. The more substantial improvement of the Instruct model over the baseline highlights its robustness in instruction-tuned tasks, making it the preferable choice for such applications, particularly when extended fine-tuning is feasible.

**Fig. 6: Comparison of averaged scores across different epochs for both the Base and Instruct models fine-tuned with the Spherical Linear Interpolation (SLERP) method.**

Table of Contents

Detailed analysis of key factors in model merging

As the results in Figs. 4 and 5 clearly reveal, SLERP appears to significantly improve model performance due to its ability to respect the geometric properties of the parameter space. However, this analysis did not yet reveal whether we have a significant synergistic effect. To examine this, we plot the results differently, comparing the actual measured performance with an expected performance that is computed by simply averaging the scores of the two parent models. To properly define all key variables, the performance of a merged model is defined as ${P}_{\text{merged};{P}_{1},{P}_{2}}$ (measured per the benchmark), while the expected, averaged score E(P₁, P₂) is calculated as the linear average of the performances of the two parent models:

$$E({P}_{1},{P}_{2})=\frac{{P}_{1}+{P}_{2}}{2}.$$

Using these metrics, Fig. 7 shows a detailed exploration of performance of SLERP variants for different cases, plotting the actual observed performance over an estimated, expected score based on a simple average of the score of both parent models (linear combination).

Fig. 7: Exploration of performance of Spherical Linear Interpolation (SLERP) variants for different cases, plotting the actual observed performance of the merged model ${P}_{\text{merged};{P}_{1},{P}_{2}}$ over a linear, expected average score E(P₁, P₂) based on a simple average of the score of both parent models (linear combination).

Notably, the strong deviation from the diagonal reveals nonlinear, synergistic effects, where the actual observed model performance is much greater than a simple averaging of the capabilities of the parent models alone. Results are shown for both the Llama-3.1-8B and Mistral-7B-v0.3 model series, respectively, for a variety of training strategies and datasets used in the process. We find that the results are similar for both models. An important distinction that can be seen in the analysis is that for the Llama models, the best-performing model (lamm-mit/Llama3.1-8b-Instruct-CPT-ORPO-SLERP) is based off the Llama Instruct model, whereas for the Mistral model (lamm-mit/mistral-7B-Base-v0.3-CPT-SFT-SLERP) it is based off the Mistral Base model.

To better understand the mechanics behind the observed effects, we briefly discuss the mathematical underpinnings of SLERP merging. Unlike linear interpolation, which assumes a flat Euclidean space, SLERP explores a richer parameter space by interpolating along a curved path on a unit sphere (we refer also to Fig. 3). This approach allows SLERP to uncover regions in the parameter space that might represent combinations of parameters more effective than those found in either model alone. SLERP further balances the specialized knowledge learned by each model, combining their strengths without simply averaging them. By avoiding high-loss regions that linear interpolation might pass through, SLERP ensures a smoother transition, potentially leading to better generalization in the merged model. The nonlinear nature of SLERP’s path also considers the complex interactions between parameters, which can reveal beneficial interactions that a simple linear combination would miss. Furthermore, SLERP may act as a form of regularization, preventing overfitting to the idiosyncrasies of a single model’s training data, thus enhancing generalization. Finally, SLERP helps mitigate the effects of catastrophic forgetting, preserving knowledge from both models when one has been fine-tuned or trained after the other. These factors combine to make SLERP a powerful tool for model merging, leading to a merged model that often performs better than either of the original models on their own.

Hence, we believe that the observed effectiveness of SLERP in merging models can be attributed to its ability to enhance nonlinear interactions between parameters by exploring the spherical geometry of the parameter space. Given two sets of model parameters θ₁ and θ₂, each parameter can be seen as a vector in a high-dimensional space. The interpolation performed by SLERP respects the curvature of this space, allowing for combinations of parameters that are not simply linear but involve deeper, nonlinear synergies (see Fig. 3). Consider the parameters θ₁ and θ₂ as consisting of individual components θ_1,i and θ_2,i in a given layer of the neural network. SLERP combines these parameters as follows:

$${\theta }_{i,\text{merged}}=\parallel {\theta }_{1}{\parallel }^{1-t}\parallel {\theta }_{2}{\parallel }^{t}\left(\frac{\sin ((1-t)\omega )}{\sin (\omega )}{\hat{\theta }}_{1,i}+\frac{\sin (t\omega )}{\sin (\omega )}{\hat{\theta }}_{2,i}\right)$$

This combination allows for interactions between θ_1,i and θ_2,i that are nonlinear in nature. For example, if θ_1,i and θ_2,i represent weights connected to different features in the network, their spherical combination could activate a new feature ϕ_i that is not present in either model individually:

$${\phi }_{i}=f\left({\theta }_{i,\text{merged}}\cdot {x}_{i}\right)$$

where x_i is the input feature and f(⋅) is the activation function. The nonlinear combination of parameters may lead to new behaviors or capabilities, as the interpolated parameters could synergistically enhance or suppress features in ways that the individual models cannot.

SLERP avoids destructive interference by maintaining the angular relationships between the parameter vectors, which can prevent the loss of specialized features learned by either model. The spherical symmetry imposed by SLERP introduces a regularization effect, smoothing the transition between the models and enabling the merged model to generalize better. This process often results in the emergence of new capabilities or improvements in performance that neither of the original models possessed.

The ability of SLERP to uncover these new capabilities can also be understood through the lens of overparameterization and the principles of ensemble methods. Overparameterized neural networks are known to generalize well, even when trained to zero error, due to their increased capacity to capture complex patterns³⁴. SLERP leverages this capacity by combining parameters in a nonlinear fashion, effectively utilizing the high-dimensional space in which these parameters reside. As a result, the merged model can exhibit emergent properties that are not apparent in either of the original models. SLERP’s mechanism resembles ensemble methods, where combining diverse models leads to better generalization³⁵. In this case, the diversity comes from the different training histories and learned features of the two models. The spherical interpolation pathway created by SLERP acts as a continuum of model ensembles, where at each point along the path, the combined parameters may activate new and beneficial feature interactions. SLERP not only preserves the strengths of the individual models but also has the potential to generate entirely new capabilities through its sophisticated interpolation method. This makes it a useful tool for our goal to merge models that complement each other or to create a more versatile and generalizable model from existing pre-trained models.

We examine the variations and potential trends in the strategies explored using clustering analysis. Figure 8 provides a comprehensive clustering analysis of SLERP strategies applied to both the Llama-3.1-8b and Mistral-7B-v0.3 models, and the resulting impact on their performance. We explore the use of two methods. First, K-Means clustering, a partition-based method that groups data into a predefined number of clusters by minimizing the distance between data points and the cluster centroids, providing insight into the natural groupings of models based on their expected and actual performance. Second, we use hierarchical clustering, an agglomerative method that creates a tree-like structure, a dendrogram, to show the nested relationships between models at various levels of similarity, revealing the hierarchical organization and potential sub-groupings within the data.

**Fig. 8: Clustering analysis of Spherical Linear Interpolation (SLERP) strategies and hierarchical clustering dendrograms for model performance.**

Figure 8 A illustrates the K-Means clustering of the Llama-3.1-8b models using standardized expected and actual scores, with Gaussian KDE (Kernel Density Estimation) applied to visualize the centroids. The analysis reveals distinct groupings that correspond to different SLERP strategies, indicating that specific merging techniques produce closely related performance outcomes. Figure 8B presents a similar K-Means clustering for the Mistral-7B-v0.3 models. Here too, distinct clusters emerge, showing that the SLERP strategies significantly influence the models’ performance profiles. Notably, the clustering patterns observed in the Mistral models are more pronounced compared to the Llama models, suggesting that the Mistral architecture might be more sensitive to these optimization and merging strategies.

Across both the Llama and Mistral models, the K-Means analysis clearly delineates two performance-based clusters. Models that incorporate multiple fine-tuning strategies, especially ORPO, consistently form clusters with higher actual scores, outperforming models that rely on simpler strategies. This suggests that the complexity and thoroughness of the fine-tuning process play a crucial role in achieving better model performance, as indicated by the clustering results. Next we explore hierarchical clustering as a way to better break down these distinctions.

To do this, we use a dendrogram analysis. A dendrogram is a tree-like diagram that displays the arrangement of clusters generated through hierarchical clustering. This visualization helps elucidate the relationships among the models, with closely related models (in terms of performance) clustering together. The dendrogram reveals that models employing similar training strategies are grouped into distinct subclusters, highlighting the effectiveness of these approaches in shaping model performance. Figure 8C introduces the hierarchical clustering dendrogram for the Llama-3.1-8b models, and Fig. 8D for the Mistral-7b models. The dendrogram demonstrates how different models cluster together, indicating similar performance outcomes. When comparing the dendrograms of the Llama and Mistral models, it becomes evident that while both models are positively influenced by SLERP strategies, the Mistral models show more defined clustering patterns. This suggests a stronger impact of the various strategies on the Mistral architecture.

Figure 9 shows the effect of using a larger CPT dataset using the extended dataset of 8000 papers, but with more varied format including more defected text, when training the Llama series. As can be seen, performance decreases, underscoring the effect of higher quality, clean data for positive training outcomes. As mentioned earlier, the extended dataset was constructed using a mix of PDF2Text and Nougat OCR³⁶; we found these methods to yield more variable text quality. While, for instance, Nougat can successfully render equations in Markup format, it also leads to a relatively frequent occurrence of unknown symbols, page breaks, repeated characters, and other defects. These methods did not cause issues when the data was further processed into question-answer pairs or summaries, rational explanations, and so on, but there is an apparent negative effect on CPT as the data is provided in raw format.

Fig. 9: Effect of using a larger Continued Pre-Training (CPT) dataset using the extended dataset, along with the lamm-mit/magpie-ultra-v0.1 dataset, including a more varied quality with a higher content including more defected text.

Likewise, a similar test case with the extended dataset was conducted with the Mistral series of models. The variant trained on the original integrated dataset achieved the best overall benchmark of 0.81, whereas the variant trained on the extended dataset achieved 0.80. These results suggest that future experiments could be conducted to assess the effect of this particular dataset variation on that model architecture’s performance. We leave this to future work; noting that the overall effects of model merging and the use of Base vs. Instruct models as basis are stable, with differences, however, in which exact strategy yields the best results: For the Llama and Mistral models, it was Instruct-CPT-SFT-ORPO-SLERP and Instruct-CPT-ORPO-SLERP. These observations are further complicated by the effect of prompting, which can skew results one way or the other. An overarching theme, however, is that consistently, SLERP merging yields super performance. For a straightforward and computationally effective way to implement a fine-tuning strategy, the procedure Instruct-CPT-SLERP is probably the best overall choice. While it does not yield the best performance for all scenarios, it generally yields strong performance. The differences show that nuanced benchmarking and prompt engineering can be critical.

Mechanistic analysis to elucidate key steps with highest impact on performance

As a next step in the analysis, we focus on correlation heatmaps to illustrate the relationships between various model attributes and the performance of merged models. As shown in Fig. 10, the performance of a merged model is denoted as P_merged, while the performance of the two parent models is denoted as P₁ and P₂. Performance improvement is defined as the difference between the performance of the merged model and the maximum performance of the two parent models:

$${\rm{Performance}}\, {\rm{Improvement}}={P}_{{\rm{merged}}}-\max ({P}_{1},{P}_{2})$$

Diversity between parent models is measured as the absolute difference between their individual performances:

$${\text{Diversity}}\,=| {P}_{1}-{P}_{2}|$$

Fig. 10: Visualization of correlation heatmaps to assess the relationships between various model attributes and the performance of merged models, where the performance of the merged model is the primary outcome of interest.

To elucidate overall trends that can be gleaned from the results, Fig. 10 depicts correlation heatmaps for the fine-tuned Llama and Mistral models. The data reveals distinct relationships between various metrics. In Llama models, a strong negative correlation between diversity and SFT suggests that higher diversity reduces reliance on supervised fine-tuning, whereas performance improvement shows moderate positive correlations with both merged performance and SFT, indicating that these factors contribute to improved outcomes. In contrast, Mistral models exhibit a more robust positive correlation between performance improvement and merged performance, especially in instruction-tuned models, where the Base model type significantly enhances merged performance. ORPO, while contributing to performance improvements in both models, has a more pronounced impact in Mistral models. Overall, the findings suggest that diversity tends to reduce SFT dependency, particularly in Llama models, while instruction-tuned Base models in Mistral benefit more from merging strategies, emphasizing the importance of model selection and optimization methods.

In model merging, there are several parameter choices, including the relative density of the parameters that are preserved across the layers of the LLMs being merged. This is exemplified in Fig. 3 for the two points merged at 30% vs. 70% along their SLERP paths. In Fig. 11 we conduct a systematic analysis on variants of the original SLERP merge used in the earlier examples, with a range of alternative options, for the best-performing strategy in the case of the Llama-3.1 Model variants (CPT-SFT-ORPO-SLERP). As depicted visually, we vary the self-attention filter values distinctly from the multilayer perceptron (MLP) values (Fig. 11A). Different weighting schemes are employed, starting from the reference case that was chosen based on earlier work²⁹. Resulting performance measures are summarized in Fig. 11B, showing that Variants G and F show the best performance (where G is a simple linear progression across the depth of the LLM) (detailed performance assessments for other variants are shown in Fig. 11).

Fig. 11: Model merging incorporates varying the relative density of how parameters are combined across the layers of the large language models (LLMs) involved (see in Fig. 3 for the two points merged at 30% vs. 70%).

Contrasting assessments with very small LLMs

While the models studied earlier were modest in size, around 7–8 billion parameters, recent research has resulted in even smaller, yet useful, models that can be particularly useful for edge computing applications, or deployment on devices such as mobile phones or robotic systems. We now examine whether such models also show the marked effects observed earlier due to model merging. We conduct this analysis using the SmolLM model series, specifically the 1.7 billion parameter model. This choice is partially motivated by the complete open access of the model, training strategy, and training data. As in the earlier analyses, we start with the base model and successively apply CPT, SFT and DPO (we found that for this small model, DPO worked better than ORPO). Though never reaching the level of absolute performance of fine-tuned 7B or 8B models, in almost all fine-tuning cases with SmolLM, we find the most significant performance increases relative to its original model with the CPT-SFT-DPO version of SmolLM being the top performing variant.

As depicted in Fig. 12 while we observe a significant emergence of new capabilities when applying SLERP to large-scale language models in the 7B and 8B parameter ranges, these emergent behaviors were absent in smaller models, such as those with 1.7B parameters. This may suggest a threshold effect where SLERP’s potential to unlock novel abilities is contingent on model size. Smaller models might lack the same level of complexity as larger 7B–8B models that have notably richer high-dimensional parameter spaces and capabilities, especially for reasoning and knowledge recall. These findings underscore the importance of model scale in the manifestation of emergent properties and provide critical insights into the interplay between interpolation techniques and model complexity. Our results contribute to the broader understanding of scaling laws in neural networks, highlighting the conditions under which advanced capabilities may be realized. A summary of the observed performance over the expected, averaged performance of the base model is shown in Fig. 7C.

**Fig. 12: Comparative performance analysis of SmolLM-1.7B models across benchmarks.**

Further quantification of the effects of model merging across all model architectures

To better understand whether or not and to what degree model merging improves performance over either one of the two models used for merging, we present the analysis shown in Fig. 13. The plot shows performance deviation of SLERP merged models compared to the best original model used as source.

**Fig. 13: Performance deviation of Spherical Linear Interpolation (SLERP) merged models compared to the best original model.**

The results reveal that the deviation in performance between models merged using SLERP and their best-performing original counterparts, whereby the deviation is calculated as the difference between the best original model’s performance and the SLERP model’s performance. Hence, negative deviations, where SLERP underperforms relative to the best original model, are marked in red. Positive deviations, indicating better performance of the SLERP model, are shown in shades of blue, with darker blue representing greater improvements.

Interactive examples for open-ended cross-material reasoning and material design tasks

In our next experiment, we conduct interactive conversations with a set of the models, using consistent system prompts and identical user input. We aim to test multi-turn capabilities of the models, assess responsiveness to system prompts and instructions, and the capability to produce structured output (JSON). We will further assess the quality of synthesis of each model, along a set of criteria that include depth of reasoning, creativity, clarity, and whether or not quantitative predictions are featured. Each of the conversations unfolds as follows:

Figures 14–18 present the results of conversations of a human user with a selection of five models (best-performing models, and DPO trained models).

**Fig. 14: Conversation generated by the model.**

**Fig. 15: Conversation generated by the model.**

**Fig. 16: Conversation generated by the model.**

**Fig. 17: Conversation generated by the model.**

**Fig. 18: Conversation generated by the model.**

All five models demonstrate a strong ability to connect seemingly unrelated concepts, such as collagen and leaves, and to propose innovative materials designs. The output features rich Markup formatting (note, the raw source is shown here). The models show a consistent understanding of the biological and materials science concepts involved, suggesting that their fine-tuning on domain-specific content has been effective. The differences in the responses mainly pertain to the depth of explanation, the creativity of the proposed designs, and the clarity of the output in both natural language and structured formats like JSON.

As shown in Fig. 14, the model lamm-mit/Llama3.1-8b-Instruct-CPT-SFT-DPO provides a well-organized and detailed discussion, drawing clear parallels between the structure and function of collagen and leaves. The proposed material design is robust, incorporating key components like a collagen-based matrix, vascular-like channels, and mesophyll-like cells. The response is notable for its comprehensive breakdown of each component’s role, leading to a thorough and scientifically grounded design. The JSON summary is precise, reflecting the structure of the proposed design effectively.

Figure 15 shows results for the lamm-mit/Llama3.1-8b-Instruct-CPT-SFT-ORPO-SLERP-Var_G model. This model delivers a more concise but also insightful analysis of collagen and leaves. The material design focuses on integrating collagen fibrils with cellulose microfibrils and chloroplast-inspired nanoparticles. This resembles an inventive approach to enhancing the material’s properties. This model excels in identifying the potential applications of the designed material, showcasing a broader vision for its use. The JSON representation is clear and well-structured, effectively summarizing the design features.

Figure 16 captures results of the lamm-mit/mistral-7B-v0.3-Base-CPT-SFT-DPO model. The responses are found to be comprehensive, with a strong focus on the mechanical properties of collagen and leaves. The proposed material design is detailed, incorporating collagen fibrils, cellulose nanofibers, silk fibroin, and a nanocellulose-based matrix. This model particularly stands out for its emphasis on the integration of these components to enhance the material’s toughness and durability. The JSON summary is thorough, capturing the complexity of the design and its potential applications.

Next, Fig. 17 shows results for the lamm-mit/mistral-7B-v0.3-Base-CPT-SFT-SLERP model. This result illustrates a more straightforward and less detailed analysis compared to the others. While the connection between collagen and leaves is adequately explained, the material design is simpler, focusing on collagen fibers, cellulose nanofibers, chlorophyll, and pectin. This response is notable for its clarity and simplicity, making it accessible but perhaps lacking the depth seen in other models. The JSON summary is basic but effective in conveying the key elements of the design.

Finally, Fig. 18 showcases the results of the smallest model in this study, lamm-mit/SmolLM-Base-1.7B-CPT-SFT-DPO. The model offers an inventive, creative, and highly detailed response, integrating a broad range of components such as collagen fibrils, nanocrystalline cellulose, and an alginate adhesive, resulting in a material creatively referred to as “Leafy-Coraline (LC) Composite”. This model excels in proposing a novel composite material with self-healing and shape-memory properties, reflecting a relatively high level of creativity and technical understanding. The JSON summary is comprehensive, capturing the innovative aspects of the design and its potential applications effectively. In spite of its size, this model provides excellent responses.

A summary of key observations is shown in Table 2. We can see that each model demonstrates strengths in different areas, from detailed explanations and innovative designs to clear and concise JSON summaries. The variations in depth, creativity, and technical detail among the models highlight the diversity of approaches and the potential for each to be suited to different types of tasks or applications. Overall, these models provide a strong foundation for further exploration and development in the field of bio-inspired materials design. A detailed breakdown of the analysis, generated using GPT-4o by analyzing the raw text of the conversations, is shown in Table 3, showing the results of an analysis conducted by sharing the raw text with GPT-4o and prompting the model to identify criteria and assess the different versions. We emphasize that GPT-4o is used as a supporting evaluator, not the definitive authority. LLM-based evaluation minimizes subjectivity and ensures reproducibility across tasks and models compared with human assessment. The potential of LLMs as evaluators has been explored in refs. ^37,38 and evaluated in refs. ^39,40,41, and GPT-4o is supported by studies beyond OpenAI’s GPT-report⁴².

Table 2 Summary of model performance, strengths, weaknesses, and design strategies based on conversations related to collagen and leaves

Table 3 Summary of model performance with individual criterion scores, total intelligence score, average score, and normalized intelligence score

Agentic use in image generation: applications in cross-domain knowledge integration for materials and urban design

We show several examples to highlight the capabilities and potential of fine-tuned LLMs, showcasing a particular application in materials design. The overall goal is to explore how the LLMs developed here can be used to reason over complex materials principles and use the insights developed through multi-step prompting to create a prompt for image generation. We note that our fine-tuned SmolLM based model performs extremely well for this task and yields quite creative prompts that integrate various ideas and concepts (underscoring its potential as a creative agent), and the visual representation of conceptual designs guided by the prompts are generated via a custom version of a FLUX.1 [dev] model⁴³ fine-tuned using a image dataset of leaf microstructures. The prompt generation demonstrates the LLM’s capability to conceptualize new materials-inspired design principles, which have the potential to be synthesized into physical structures. Specifically, the innovative designs are guided by the decomposed principles generated through prompt generation, which are directly relevant to materials science. Besides, LLM-generated prompts, new design possibilities can be explored that may not have been intuitive through traditional design processes. This approach bridges the gap between conceptual design and practical material development, and shows how AI can assist in design informatics by accelerating concept exploration and identifying viable design features.

While slight variations of prompting are used to yield different examples (each result presented here includes a detailed presentation of all features), the general goal is to think about design principles that we can extract from combining different biological materials. For instance, we ask the model to think about ways to combine design elements from spider silk and collagen to make a strong, lightweight but tough material, and to also incorporate design cues from leaf microstructures. The approach can be used to focus directly on material microstructures but can also be used to yield cross-domain results, such as architectural ideas or city design. We have provided several design examples that span a range of applications, from structural to urban design, to illustrate the versatility and potential of bio-inspired design principles. The examples showcase how models trained on biomimetic materials can conceptualize structural motifs commonly observed in hierarchical composites and biomaterial-inspired structures. Furthermore, by presenting visual representations of bio-inspired designs, we demonstrate the capability of the LLM to bridge conceptual design with potential real-world synthesis in the field of materials science.

In our first example, we prompt the lamm-mit/SmolLM-Base-1.7B-CPT-SFT-DPO model as follows (Fig. 19 shows the entire conversation):

**Fig. 19: Conversation generated by the model.**

A collection of images produced in this way is shown in Fig. 20. The resulting images resemble a visionary architectural concept where biomorphic structures seamlessly blend with nature, creating a futuristic and sustainable environment. The designs are inspired by natural forms such as honeycombs, coral, and cellular structures, characterized by fluid, curving lines and intricate lattice-like frameworks that evoke the organic world. These structures are integrated with greenery, emphasizing harmony with the environment and suggesting the use of innovative, eco-friendly materials. The open, flowing spaces with large archways and natural light emphasize a connection with the outdoors, creating a sense of tranquility and well-being. As can be seen in some of the images, human figures within these spaces highlight the livability and community-centric design, suggesting a vision where technology and nature coexist harmoniously. The images present a forward-looking approach to architecture, where sustainability, esthetics, and advanced techniques converge to create a new paradigm for living and public spaces. A close inspection of the resulting designs further suggests a clear emergence of the leaf microstructure patterns, a result of the prompt and the fine-tuned ability of the generative model to incorporate this particular design idea.

**Fig. 20: Image generation results developed by lamm-mit/SmolLM-Base-1.7B-CPT-SFT-DPO, leading to this prompt used for image generation (through FLUX.1 [dev]⁴³).**

These design ideas have the potential to be implemented, with specific functionality emphasized. Real-life examples of bio-inspired architectural designs include “The Hive”, a building on the NUT campus designed by Heatherwick Studio. Its honeycomb structure mirrors the cellular organization of hives, providing a modified modular hexagonal form that improves structural efficiency, particularly by optimizing airflow and ventilation systems. Another compelling example is “Little Island” in New York, which is designed to resemble floating leaves and create a dynamic urban landscape. The design shows resilience to climate change, offering flexibility in adapting to changes in water levels.

As illustrated in these examples, bio-inspired designs in architecture go beyond mere esthetics by incorporating natural elements into urban facilities, often enhancing functionality. Specifically, as shown in Fig. 20, these designs have the potential to improve sustainability and energy efficiency by optimizing material usage and structural topology compared to conventional designs. For instance, the designs in row three reflect design simplicity and potential material saving, incorporating non-uniform cellular structures tailored for different space usages. The structural design also satisfies the load path analysis, with larger column sizes at lower floors compared to the top ones, which also shows the load-bearing capacity enhancement. Moreover, energy efficiency can be improved by optimizing heating and cooling systems, as seen in designs that mimic leaf veins, such as the first design in row one. The second design in row five demonstrates a bio-inspired roof design, with beam thickness inspired by leaf veins to maximize lighting and energy efficiency. Additionally, these designs enhance the user experience by integrating natural elements into urban facilities, such as bridges and pathways within the designs. However, to actualize these designs in real-life applications, further research is needed, which includes studying wind effects and façade analysis considering the integration of extended plants and ensuring the loading and structural integrity, as well as selecting materials that are compatible with greenery.

Furthermore, bio-inspired structures featuring cellular patterns have the potential to be further enhanced with topology optimization for various purposes, such as optimizing material usage while maintaining structural integrity and architectural features. We note that these bio-inspired designs reveal innovative design approaches for environmental integration and sustainability improvements, offering a promising start for exploring the vast design space, although more research is needed to fully validate the application in the real world.

In our second example, we prompt the lamm-mit/mistral-7B-v0.3-Base-CPT-SFT-DPO model as follows (Fig. 21 shows the entire conversation):

**Fig. 21: Conversation generated by the model.**

A collection of images produced in this way is shown in Fig. 22. The images showcase close-up views of a novel form of a biological material, highlighting their intricate vein patterns and microstructures. The leaves exhibit a variety of geometric vein arrangements, ranging from polygonal to radial patterns, all sharply contrasted against the green leaf surfaces by bright yellow to gold veins. The diversity in leaf shapes and vein structures offers a variety of structural options, including a prominent feature of spider-web-like motifs. Several of the leaf patterns in the image resemble spider webs or spider-web-like structures. The veins in these particular leaves form radial patterns that converge toward a central point, much like the structure of a 2D orb spider web. This similarity is especially pronounced in some of the leaves where the vein network is more symmetrical and evenly spaced, creating a web-like appearance. These spider-web-like patterns, some of which resemble projections of various 3D webs such as cobwebs and sheet webs, add an interesting visual for studies related to natural design and biomimicry. A close inspection of the resulting images shows that textures and depth are captured in fine detail. We find that the soft yet focused lighting accentuates these patterns.

**Fig. 22: Image generation results developed by lamm-mit/mistral-7B-v0.3-Base-CPT-SFT-DPO, leading to this prompt used for generation (through FLUX.1 [dev] AI⁴³).**

Figure 23 shows a few additional sample images specifically prompting the model to develop urban design ideas based on a set of biological materials, including spider silk, collagen, and leaves, developed by lamm-mit/SmolLM-Base-1.7B-CPT-SFT-DPO. The images presented illustrate a conceptual approach to urban design that synthesizes advanced architectural techniques with principles of ecological integration and sustainability. The structures exhibit biomimetic design, characterized by their spiraling, organic forms that mimic natural patterns, possibly influenced by leaf microstructures. This design approach aligns with the principles of biophilic architecture⁴⁴, which aims to reconnect urban environments with the natural world by incorporating natural elements into the built environment.

Fig. 23: Sample images specifically prompting the model to develop urban design ideas based on a set of biological materials, specifically spider silk, collagen and leaves, developed using lamm-mit/SmolLM-Base-1.7B-CPT-SFT-DPO.

Several similar architectural designs have been realized, validating the potential for these generated concepts. Examples include Azabudai Hills, a district in Tokyo featuring curving planted rooftops; CapitaSpring, a skyscraper in Singapore with orthogonal strips of plants embedded in its façade; and the Vertical Forest, designed by Stefano Boeri Architetti, which integrates residential buildings with diverse greenery. However, there remains significant potential to explore and expand these ideas within extended urban systems, for instance, how to interconnect individual structures, ensuring seamless integration with other facilities and alignment with an overall urban planning strategy.

The integration of nature into architecture has roots in movements such as organic architecture, as advocated by Frank Lloyd Wright⁴⁵, where the harmony between human habitation and the natural world is paramount. The designs proposed here, however, push this concept further by embedding extensive greenery directly into the architectural framework, creating vertical gardens and green terraces that are integral to the building’s structure rather than ancillary elements. Moreover, the buildings are interconnected through elevated walkways, which not only facilitate human movement but also promote ecological connectivity, potentially serving as corridors for urban wildlife and contributing to biodiversity. This interconnectedness suggests a systems-thinking approach to urban design, where the built environment is considered part of a larger ecological network rather than an isolated entity.

The design concepts represented in these images could perhaps be seen as a potential paradigm shift in urban planning, where we move beyond sustainability to focus on regenerative design. This approach aims to create urban environments that not only minimize ecological impact but actively restore and enhance the natural environment. Such a model could potentially represent a significant advancement in urban ecology, proposing a future where cities operate as living systems, integrated with and supportive of their surrounding ecosystems. More work would be necessary to explore this, but this example illustrates a use case where the methods developed here can guide creative research and technology developments.