LLMs at the Edge: Decentralized Power and Control

Bhanuprakash Madupati

January 3, 2026

Most of the Large Language Models’ applications have been implemented in centralized cloud environments, raising concerns about latency, privacy, and energy use. This chapter examines the possibility of using LLMs in decentralized edge computing, in which computing assignments are partitioned across connected devices instead of distant hosts. Therefore, by applying approaches like quantization, model compression, distributed inference, and federated learning, LLMs solve the problems of limited computational and memory resources on edge devices, making them suitable for practical use in real-world settings. Several advantages of decentralization are outlined in the chapter, such as increased privacy, user control, and enhanced system robustness. In addition, it focuses on the potential of employing energy-efficient methods and dynamic power modes to enhance edge systems. The conclusion re-emphasizes that edge AI is the way forward as a responsible and performant solution for the future of decentralized AI technologies, which would be privacy-centric, high-performing, and put the user first.

1 Introduction: LLM Centralization and the Case for the Edge

First, large language models (LLMs), such as those in the recent GPT-3, have proved crucial in processing and generating natural language and are core in applications like translation, chatbots, and content generation. Nonetheless, LLMs depend on centralized cloud infrastructure, which has drawbacks. Clients of these models demand significant computational power and storage, making real-time response a potential issue and privacy concern as the data is sent to distant servers.

Another is edge computing, where processing occurs closer to data generation sources. In an edge-based model, the LLMs are placed at the edge locations on user devices such as smartphones, IoT devices, or edge servers. This helps minimize latency and ensure data privacy, as the data does not need to be transferred to the cloud. This decentralized architecture also presents more robustness, as the edge devices can continue operating when disconnected from the cloud.

A significant consideration for utilizing LLMs at the edge is that edge devices are often resource-limited in terms of processing ability, storage capacity, and energy reserves. Computing LLMs, a time-consuming process on these devices, necessitates special optimization and resource utilization methods.

The key to edge AI is its technical strengths in latency, privacy, and robustness, which address issues of limited resources. This chapter focuses on edge-based LLMs to understand how they could be deployed in practice, address the issues, and determine how they might be solved (Ismail Lamaakal et al., 2025).

2 Major Issues in Edge Computing

Using LLMs in an edge setting has challenges, mainly because edge devices are inherently limited in their capabilities. These devices, such as smartphones, IoT devices, and embedded systems, have more computation power, memory, and energy sources than cloud servers. These are used in standard methods such as model-parallel and pipeline-based training of large models like GPT-3.

2.1 Computational Power Constraints

For that, LLMs are resource-demanding as they need to analyze massive datasets and perform complicated language tasks. However, edge devices are usually outfitted with low-power processors (such as mobile CPUs or GPUs) and, therefore, are incapable of the computational load that LLMs require. Though distribution inference (a form of inference where the load is split among the devices) can be used, there is a limit to how much computation can be pushed to the edge of the devices due to their capacity. This is especially true because many modern applications are real-time and need low latency. This issue is depicted in Figure 1 . Deploying edge-based distributed inference models can go a long way in addressing this challenge.

Figure 1: Decentralized Inference Network Architecture

This figure illustrates consumption and the combination of components with their respective energy sources to carry out inference operations. Such a distributed model allows better distribution of the computational load among the various devices, thus reducing the load on each and boosting the whole system’s efficiency.

2.2 Memory Limitations

The next weak point is the memory available in edge devices. As mentioned before, the number of parameters in LLMs can be in the billions, which makes them very large. Edge devices have constrained RAM and storage resources, which prevents large models from loading and processing. To address this, quantization and model compression help reduce the model’s size. However, these techniques may also cause a barely noticeable tradeoff in accuracy.

Besides memory limitations, the energy consumption during the LLM inference tasks can also be modeled using Formula (1). This is because more complex models with more memory require more energy to process in edge devices.

Formula (1): Battery Energy Update

Where is the energy in the battery at the beginning of stage m, is the energy collected from the local energy source during stage m, and CE(PM) ≤ denotes the energy consumed by node j for computation in stage m.

2.3 Energy Constraints

Most edge devices are portable and therefore reliant on batteries and poison, which causes challenges in energy consumption. The execution of LLM conferences requires energy and the usage of batteries, thus limiting their duration. Techniques such as solar or kinetic energy converters can also power the edge devices and increase their operational time. However, some energy harvesting systems are unreliable or consistent; the power to supply the energy needed for inference may not always be available. Dynamic power modes such as those modify the power consumption depending on the available power and ensure the devices can run longer without a power break.

Factors such as computing capability, memory, and power constrain using LLMs at the edge. To overcome these challenges, these models must be designed for the edge using a careful tradeoff between computation and energy requirements, new energy sourcing methods, and distributed analytics methods. The following sections will briefly discuss the quantization, model compression, and other optimization strategies necessary to perform LLM inference on edge devices.

3 Quantization and Model Compression

Significant hurdles exist when using LLMs on edge devices, including the memory space requirements and computational complexity. These devices often have restricted computational power and a small amount of storage, which applies to smartphones, IoT devices, and end nodes such as embedded systems. This has challenged the direct employment of full-scale LLMs on these devices. There is a challenge in applying full-scale LLMs to these devices. However, quantization and model compression serve as solutions that reduce the size of such models without limiting the capabilities that can be deployed at the edge(Chen et al., 2024).

3.1 Quantization

Quantization means that the weights of the model are represented with fewer bits than 32-bit floating-point numbers, for example, 8-bit integers. This results in savings in memory and the computation load required in the subsequent stages of the algorithm. The reduced precision means that computations will require the least time possible on the edge devices ideal for real-time applications. However, the loss in accuracy is relatively minor, while the gains in speed and power are generally much more profound.

3.2 Model Compression

Others include pruning, sparsity, which is also known as quantization, and knowledge distillation, which is used for model compression. For example, pruning removes shapes with less significance in the model, meaning fewer parameters must be computed during the real-time process. Observations: Sparsity also introduces many zero weights and thus reduces computational complexity. Knowledge distillation is a technique that lets a smaller student model mimic the performance of a larger, more complex teacher model while requiring significantly less computing power.

These techniques balance the number of computations and the model’s performance, so these LLMs can be deployed on devices with limited resources without a severe reduction in accuracy.

3.3 Energy Efficiency and Job Throughput

Regarding the quantization and model compression steps, the total energy consumption of the edge devices is drastically lowered. Figure 2a shows how job throughput and energy savings depend on the different power modes. This proves that by having devices switching between dynamic power modes, it will be possible to complete more jobs faster and, at the same time, consume less power. As the model size is decreased by quantization or compression, the energy needed to complete each inference task decreases, and letti and gallowingvice perform more tasks.

Figure 2: Power Modes Study

As shown in Figure 2a, the number of jobs completed and the average battery level of the devices under different power modes are 15W, 30W, 60W, and dynamic. Flexible power control modes ensure maximum job completion rates and optimal energy usage to accomplish this in edge devices. Energy is further reduced when models are compressed and quantized. Hence, more tasks are conducted on edge, especially on battery power devices that need higher battery levels to sustain continuous operations.

4 Pruning, Sparsity, and Knowledge Distillation

Aside from quantization and model compression, other strategies such as pruning, sparsity, and knowledge distillation are also essential to bringing out the LLMs for edge computing. These new methods, aim to simplify the model to remove small parts of it, which can lead to a more efficient inference process and fewer energy demands from edge gadgets (Dantas et al., 2024).

4.1 Pruning

Pruning is omitting several weights or connections from the model that do not significantly influence its performance. To explain, pruning eliminates those extra parts, which makes the model size and the number of computations needed for predictions much smaller. In edge deployment, pruning reduces size, enabling large models to be deployed in devices with limited memory and computation capabilities. Pruning is similar to sparsity, incorporating zeros into the model, making the inference process less time-consuming. This can accelerate the model and decrease memory consumption, especially for edge devices.

4.2 Knowledge Distillation

Knowledge distillation entails mapping existing knowledge from a large and complicated model (the teacher) to a small and simpler one (the student). The student model is a lower-capacity version of the teacher model. Its goal is to mimic the operations of the teacher model with fewer parameters, meaning it is less resource-demanding than the main model. This approach is especially beneficial when distributing LLMs to edge devices since it enables smaller models to approximate the behaviors of the large models in such devices. The advantage of knowledge distillation is that while it has much accuracy in the original model, it can be much smaller.

4.3 Energy Efficiency and Model Optimization

Pruning, sparsity, and knowledge distillation cause models to consume less energy, which is necessary when using LLMs in real-world devices with limited battery life. Firstly, powering the device, reducing the number of parameters used in Order prediction, and reducing memory usage will decrease energy consumption in the inference process. This is especially critical in battery-operated machines, where energy conservation translates to increased usage times. These align with the dynamic power modes described earlier, involving energy and computing in equal measures to yield good performance.

5 Model Partitioning and Hybrid Architectures

Fine-tuning LLMs for edge devices presents many challenges due to their restricted computational power and memory. Model partitioning and hybrid architectures remain promising approaches to dividing the computation load across devices and enabling edge systems to meet the high demands of LLMs while delivering low latency and privacy-preserving performance.

5.1 Model Partitioning

Model partitioning divides a large LLM into multiple small LLMs that various edge devices or nodes can run. This means that by breaking down the model into separate parts, each device can solve a segment of the overall problem, thus lessening the load on one precise device. This is especially useful for resource-scarce edge devices, as it breaks the model into parts and allows to compute them in parallel, which saves time. As described earlier, model partitioning can help mitigate some of these issues at the expense of increased difficulties in data synchronization and communication between devices.

5.2 Hybrid Architectures

Now, hybrid architectures use the best of cloud computing and edge computing to mitigate challenges with edge devices. In the hybrid LLM architecture, edge devices perform a part of the workload (less complex) while the cloud takes up the more complex computation. This allows LLMs to be run on edge devices without high computational resources on the device. Also, in the case of hybrid architectures, new edge nodes can be implemented without difficulty since they add more points to the network, whereas, in the cloud, processing and model updates can readily handle additional load (Andriulo et al., 2024).

5.3 Energy Efficiency in Hybrid Systems

Figure 2b illustrates how hybrid structures can aid energy control by partitioning workloads according to their complexity. More complex computations are processed on the cloud, while the relatively minor load of more straightforward computations is accomplished on devices with limited energy sources. This strategy helps to balance energy use and avoid overloading battery-driven edge devices.

6 Distributed Inference and Federated LLMs

In the distributed scenario of LLM on-edge devices, the computation burden is distributed across multiple edge devices to overcome the constraints of single-edge devices. Distributed inference and federated learning are two concepts that enable AI across multiple edge devices while addressing the issues of data privacy and power consumption.

6.1 Distributed Inference

An LLM is partitioned into more manageable sub-models in distributed inference, and the various edge devices work together to process it. Every device takes a portion of the model’s computational load, meaning the overall workload is split. This makes it easier to scale as it takes advantage of parallel processing, where the system can simultaneously work on several similar steps. However, time communication overhead is a significant issue since the devices must agree on the specific computations being performed and send each other updates on any results or lack thereof. This leads to more energy transfer, which, in turn, puts more strain on the devices, especially those that use batteries (Piccialli et al., 2024).

The following equation can express the power consumption of a system of distributed inferences:

Where:

E_ comp , i is the energy consumed by edge device iii for computational tasks.
E comm , i is the energy used by device i for communication tasks, such as syncing data and sharing updates with other devices

Distributed inference deals with the computational load and reduces the communication overhead, which leads to efficient energy utilization

Federated Learning

Federated learning extends distributed inference in which devices cooperate in the training and inference processes without exchanging data. Every device learns a local model from the local data, and only local model updates are transmitted between the devices or a central parameter server. It keeps data private but allows for overall model enhancement.

Also, distributed inference and federated learning help minimize energy consumption by shifting some of the load to other devices to prevent overloading a single device. However, managing data synchronization and model updates is critical when avoiding unnecessary energy consumption due to inter-device communication.

7 Decentralization for User Autonomy and Resilience

One of the benefits of implementing LLMs at the edge is that it provides greater control to the end users. Edge-based systems can decentralize the computational tasks to give users more command over the data and lessen reliance on centralized cloud solutions. It is essential for private data applications like healthcare, finance, and government.

7.1 User Autonomy

With decentralized LLMs deployed on smart devices, individuals cannot transfer their information to remote cloud servers for analysis. This not only enhances privacy but also allows users to have some level of control over their information. Real-time processing and inference occur at the edge of Edge-based LLMs, allowing users to perform faster and receive more personalized services. Data sovereignty is further enhanced as the end users decide where the data is processed, in either local or cloud environments(Yan et al., 2025).

7.2 Resilience

Edge devices also provide added robustness to a system due to the decoupling of the system from a centralized server or cloud network. This means that even if the connection to the cloud is unavailable or insecure, edge-based systems can continue working, thus making it possible to maintain operations. This is especially important in volatile environments that lack a reliable internet connection, as offline functionalities are vital. It also helps eliminate the single point of failure situation, which makes the systems more reliable.

Given such benefits, decentralizing decision-making in LLMs towards the edge enables user control while increasing system robustness, specifically in privacy-conscious and operational-critical scenarios.

8 Conclusion: Edge AI as the Future of Responsible Intelligence

Decentralized systems have become incredibly essential as the world has become more connected and dependent on artificial intelligence. The proposed use of LLMs at the edge addresses some of the issues in privacy, latency, and the limitations of resources and power in cloud-based AI systems. Edge AI also allows for better sovereignty over data and faster response and cuts reliance on centralized servers by performing calculations nearer to the users.

The advantages of LLMs are more evident: decentralization empowers users, offers robustness, and provides privacy protections while boosting energy and computational resource efficiency. As quantization, model compression, and distributed inference improve, edge-based LLMs will become feasible for many industries, including healthcare, finance, automobiles, and more.

However, responsible AI requires decentralized, ethical, and efficient artificial intelligence models for users. Edge AI presents a way to realize this vision and keep AI topologies open, safe, and relevant to the customer.

Bibliography

[1] K. Lazaros, D. E. Koumadorakis, A. G. Vrahatis, and Sotiris Kotsiantis, “Federated Learning: Navigating the Landscape of Collaborative Intelligence,” Electronics, vol. 13, no. 23, pp. 4744–4744, Nov. 2024, doi: https://doi.org/10.3390/electronics13234744.

[2] F. Piccialli, D. Chiaro, P. Qi, V. Bellandi, and E. Damiani, “Federated and edge learning for large language models,” Information Fusion, pp. 102840–102840, Dec. 2024, doi: https://doi.org/10.1016/j.inffus.2024.102840.

[3] Ismail Lamaakal et al., “Tiny Language Models for Automation and Control: Overview, Potential Applications, and Future Research Directions,” Sensors, vol. 25, no. 5, pp. 1318–1318, Feb. 2025, doi: https://doi.org/10.3390/s25051318.

[4 ] Y. Chen, C. Wu, R. Sui, and J. Zhang, “Feasibility Study of Edge Computing Empowered by Artificial Intelligence—A Quantitative Analysis Based on Large Models,” Big Data and Cognitive Computing, vol. 8, no. 8, p. 94, Aug. 2024, doi: https://doi.org/10.3390/bdcc8080094.

[5] P. V. Dantas, W. Sabino da Silva, L. C. Cordeiro, and C. B. Carvalho, “A comprehensive review of model compression techniques in machine learning,” Applied Intelligence, vol. 54, no. 22, pp. 11804–11844, Sep. 2024, doi: https://doi.org/10.1007/s10489-024-05747-w.

[6] F. C. Andriulo, M. Fiore, M. Mongiello, E. Traversa, and V. Zizzo, “Edge Computing and Cloud Computing for Internet of Things: A Review,” Informatics, vol. 11, no. 4, p. 71, Sep. 2024, doi: https://doi.org/10.3390/informatics11040071.

[7] B. Yan et al., “On protecting the data privacy of Large Language Models (LLMs) and LLM agents: A literature review,” High-Confidence Computing, p. 100300, Feb. 2025, doi: https://doi.org/10.1016/j.hcc.2025.100300.

Bhanuprakash Madupati

Technology Leader | Editor & Reviewer – BizTech Bytes

Bhanuprakash Madupati is a distinguished Technology Leader at the Minnesota Department of Corrections with expertise in enterprise systems, cloud computing, and digital transformation. A Fellow of BCS, IES, RSA, and RSS, he is a Senior Member of IEEE and a Sigma Xi Full Member. At The BizTech Bytes, he contributes as an editor, reviewer, and thought leader. Bhanu is AWS Certified and actively engages as a speaker, jury judge, and mentor.

🔗 View LinkedIn Profile

Bhanuprakash Madupati

Editorial Team The Biztech bytes

1 comment

Samaresh Kumar Singh January 9, 2026 at 4:12 pm

The article represents a timely and structured approach to using large language models at the edge, which provides an excellent insight into the reasons for low-latency, privacy, reliability and user-control. The technical approaches used to optimize (quantize, prune, distill, etc.) and hybrid architectures are also accurate in their description of what can realistically occur at the edge of networks. The authors provide an excellent balance between how architecture can be used to support various aspects of the problem and the challenges that will need to be addressed when deploying large language models, especially in terms of energy and resources available at the edge. The authors provide additional strength to their narrative, through the incorporation of both distributed inference and federated learning in addressing issues of scale and privacy of data. However, it should be noted that some slight language editing to improve clarity, remove redundancy and fix minor grammatical errors will help improve the overall quality of the paper. Additionally, some of the figures, formulas and technical descriptions may require more context for a broader Business Technology audience. A summary of a few actionable design takeaways for practitioners in the conclusion would make the conceptually strong conclusions stronger. With these very slight language edits to enhance clarity, the paper would make a very valuable contribution to The BizTech Bytes.

India’s Role in the Global AI Ecosystem

AI Is Reshaping Enterprise Integration: How Intelligent,…

Intelligent ETL Automation: Redefining HIPAA-Compliant Healthcare Claims…

The Rise of AI Governance: Why Trust…

Closing the Loop: How Connected GPS Machine-Control…

The Transformation of Retail Intelligence: How Claude…

How AI Could Make Audio Systems Better…

AI-Empowered Test Orchestration: Redefining Quality Assurance in…

Breaking the Monolith: Why Enterprise Android Teams…

Dr. Tangina Sultana Joins The BizTech Bytes…

The BizTech Bytes