dragonxi.dev
#DragonXi | |Development Meta
#Tile programming | Output matrix is devided into multiple tiles | Each Block is responsible for the calculation of one output tile | cuTile automatically handles memory access and thread synchronization | Each Block processes a (tm × tn) tile of output matrix C | Loop over 6K dimension, loading corresponding tiles of A and B one by one | Use ct.mma() to perform matrix multiply-accumulate (automatically invoking Tensor Cores). | Finally, store the accumulated results back in global memory
#Launch code | Runs on CPU | Launches kernel GPU code | Calculates how many blocks are needed | Sets tile size | Passes parameters to kernel code in GPU | Specifies which CUDA stream kernel executes on | Defines how many blocks to launch | Calls function decorated with @ct.kernel
#TileGym | Tile kernel library | Provides a rich collection of kernel tutorials and examples for tile-based GPU programming | Building efficient GPU kernels | Exploring integration into large language models such as Llama 3.1 and DeepSeek V2 | Rich collection of CUDA Tile kernel examples
#Tensor core | Specialized hardware unit within NVIDIA GPUs | Designed specifically to accelerate intensive matrix math that powers modern AI and high-performance computing
#CUDA core | Process individual numbers (scalars)
#CUDA-Tile | Tile based programming model | Designed to simplify GPU kernel development and provide portability for NVIDIA-Tensor Cores
#CUDA Toolkit | To use CUDA Tile, you need underlying CUDA Toolkitand and specific NVIDIA cuTile Python
#rog-strix-x870-f-gaming-wifi | AMD Socket AM5 for AMD Ryzen 9000 & 8000 & 7000 Series Desktop ProcessorsAMD-Ryzen-9000 | AMD X870 chipset | 4 x DIMM slots, max. 256GB, DDR5 | Dual channel memory architecture | Supports AMD Extended Profiles for.Overclocking (EXPO) | DIMM slots 2 and 4 used | 1 x DisplayPort (connected display) | 1 x HDMI port (connected display) | 2 x USB4® (40Gbps) ports support USB | 6 x USB 10Gbps ports (5 x Type-A + 1 x USB Type-C ) |4 x USB 5Gbps ports (4 x Type-A) (connected DVD using two ports) | 1 x PCIe 5.0 x16 slot supports x16 mode (installed graphics card) | 1 x Optical S/PDIF out port | 4 x M.2 slots (installed two SSDs) | 2 x SATA 6Gb/s ports (installed SATA3 hard disk and SSD hard disk) | AMD Ryzen™ 9000 processor (installed) | 1 x Intel® 2.5Gb Ethernet (installed own LAN card) | Bluetooth® v5.4.(not used) | 2x2 Wi-Fi 7 (802.11be) (not used) | ROG SupremeFX 7.1 audio | 1 x 4-pin CPU Fan header | 1 x 4-pin #CPU OPT Fan header | 1 x 4-pin AIO Pump header | 5 x 4-pin Chassis Fan headers | two 140mm fans in front, one preinstalled | three fans (1x120mm,2x140mm) installed on top | Bottom fan installed with special cover | No fans beside, space used for externsl hdd and ssd sata3 disks | 1 x 24-pin Main Power connector | 2 x 8-pin +12V CPU Power connectors | Corsair RM1000x, 80 PLUS Gold ATX 3.1 power supply unit (PSU) | 1 x CPU Over voltage jumper | 1 x Front Panel Audio header (F_AUDIO) | 1 x Start button | 1 x 10-1 pin System Panel header | 1 x Thermal Sensor header | bequiet-pro-901 case | PW_SW (Power switch)
#bequiet-pro-901 | Interchangeable top covers and front panels ensure maximum airflow or virtually silent operation | Removable brackets for fans and radiators with integrated fan hub at the top and front of the case simplify installation | Silent Wings 4 PWM fans ensure virtually inaudible operation and high performance | Touch-sensitive I/O panel for state-of-the-art handling | Generous dimensions allow the use of E-ATX motherboards and radiators up to 420mm | Well-designed system for extremely user-friendly handling | The decoupled motherboard tray can be installed inverted | Removable motherboard tray to install compinents easier
#GeForce-RTX-5060 | NVIDIA-Blackwell architecture | GB206 graphics processor | 5th Generation Tensor Cores | 4th Generation Ray-Tracing-Cores | Utilizes GDDR7 memory | DLSS 4 technology, Multi Frame Generation | DisplayPort 2.1b | PCI-Express 5.0 x8 interface | OK for CUDA Tile programming | 3x DisplayPort 2.1b (up to 80 Gbps effective speed) |1x HDMI 2.1b (max 48 Gbps) | GPU 199 x 116 x 40 mm fits inside bequiet-pro-901 | Windows can detect and manage headless GPUs | Dedicated-GPU-Memory (8GB) | Driver version: 595.79 | Driver type: DCH | Direct3D feature level: 12_1 | CUDA cores: 3840 | Resizable bar: Yes | Graphics boost clock: 2512 MHz | Memory data rate: 28.00 Gbps | Memory interface: 128-bit | Memory bandwidth: 448.03 GB/s | Total available graphics: 24098 MB | Dedicated video memory: 8051 MB GDDR7 | System video memory: 0 MB | Shared System Memory: 15947 MB | Video BIOS Version: 98.06.39.00.68 | Bus,: PCI Express X | Device ID: 10DE 2D05 41A21458 | Part Number: G152 0025 | Wide Quad Extended Graphics Array (WQXGA) | Resolution: 2560x1600 pixels | Total Pixels: ~ 4.1 million pixels | Aspect Ratio: 16:10 | NVIDIA-GeForce-RTX-5060 can handle WQXGA (2560x1600) displays | NVIDIA-GeForce-RTX-5060 ok for Mainstream Gaming | 8GB-GDDR7-VRAM provides necessary memory for higher-resolution textures | NVIDIA-GeForce-RTX-5060 448-GB/s-Bandwidth aids in processing larger data loads of WQXGA displays | DLSS-4 Support is Essential for boosting frame rates | NVIDIA-GeForce-RTX-5060 is significantly more powerful than Ryzen-9-9900X integrated Radeon graphics | Internal-graphics can light-up-display, they are not designed for gaming at UWQHD (3440 x 1440) resolutions | ROG-STRIX-X870-F GAMING-WIFI provides necessary PCIe 5.0 support for RTX 5060 x8 interface, ensuring no bandwidth limitations
#AMD-Ryzen-9-9900X | 12 cores | 24 threads (each core runs two threads, logical processes) | Virtualization: Enabled | Clock speed with a max boost of 5.6 GHz, DDR5-5600 MT/s | Integrated memory controller officially supports a maximum native speed of DDR5-5600 MHz with only two (2) single-rank (1Rx8 or 1Rx16) or dual-rank (2Rx8) DIMMs installed | BIOS may automatically downclock memory to a stable speed (likely 3600 MHz) | Using two memory modules in recommended dual-channel configuration (DIMM_A2 and DIMM_B2) | AMD-Ryzen-9-9900X processor uses AM5 socket and is officially supported by Asus-ROG-Strix-X870-F-Gaming-WiFi motherboard | Instruction Setsbi include AVX-512 and AVX-512 VNNIb which can accelerate AI and math-heavy simulation workloads | AMD-Ryzen-9-9900X is ok for Isaac-Sim Requirements | Improved thermals staying around 76°C–80°C under heavy load, which is beneficial for long simulation runs | installed with Noctua-NH-D15-G2 CPU cooler
#Blackstorm-34-UWQHD | Refresh Rate 180 MHz | Response time: 1ms | Panel type: FastVA (1500Rcurve) | Contrast: 4000:1 | Monitor does not have a DisplayPort Output for daisy-chaining | All of its DisplayPort and HDMI ports are strictly input-only
#Samsung-ViewFinity-S6-S34C650U-34-UWQHD | Display: 34" UWQHD (3440 x 1440) VA panel with a 21:9 aspect ratio | Performance: 100 Hz refresh rate and 5 ms response time, making it suitable for both professional work and casual gaming | Visuals: Supports 1.07 billion colors and HDR10 with a 3000:1 static contrast ratio | Connectivity: Features a built-in USB-C docking station with 90W power delivery, a LAN port (Ethernet), and a KVM switch to control two computers with one keyboard/mouse | Audio: Integrated 5W stereo speakers | Ergonomics: Height-adjustable stand with tilt and swivel support
#Corsair-RM1000x-ATX-3.1 | Corsair-RM1000x-ATX-3.1 is a fully modular power supply that is highly compatible with the bequiet-pro-901 full tower case | 1000W Corsair PSU provides more than enough power GeForce-RTX-5060 Windforce OC 8GB | Since the RM1000x is fully modular, you only attach the cables you need | PSU is installed in the bottom tunnel area using a specific bracket that you first attach to PSU and then slide into case from rear | bequiet-pro-901 case comes with a specific PSU Bracket | PSU fan should generally face downwards | Dark Base Pro 901 case has a bottom-mounted grille with a full-length, slide-out dust filter. This setup allows PSU to draw cool, fresh air from outside case | 1 x ATX 24-pin (Main Motherboard Power) | 2 x EPS 4+4-pin (CPU) | 4 x PCIe 6+2-pin (GPU) | 6 x PATA 4-pin (known as a Molex 4-pin peripheral power connector) | 12 x SATA (hhd, ssd...)1x PCIe 12V-2x6-pin (12+4) Cable (new GPUs) | Corsair PSUs and their accompanying Type 4 cables are built with quality materials (like 18 AWG tinned copper wire in some cases) to handle expected loads safely
#SteelSeries-Apex-Pro-Gen-3 | On-Board Memory: 5 Custom Profiles | Processor: 32 bit ARM | OmniPoint 3.0 Adjustable HyperMagnetic Switches | 5 Custom ProfilesRapid Trigger: making key active as soon as it is pressed or released | Protection Mode: protects against accidental key presses by reducing the sensitivity of selected surrounding keys | Rapid Tap: Prioritize the last pressed key in a pair without needing to release the previous key | QuickSet: scans for installed games and recommends presets for supported titles, automatically applying per-key actuation, Rapid Trigger, and Lighting settings optimized for each game | Configurations to make you excel in your favorite game come from a library of ready-made presets | Customizable key pairings for stellar counter-strafing, peeking, crouch jumping or slide canceling | 2-in-1 Action Keys: Program two different actions to the same key, such as walking with a light touch or sprinting with a deep press crafting your own advanced combos | Glow up your dream setup with 16.8 million color options | Switch Rating: Up to 100 million keystrokes | Adjustable Actuation Points: 0.1 mm - 4.0 mm | Response Time: 0.7 ms
#Logitech-G502-X-Plus-Lightspeed | LIGHTFORCE Hybrid Switches: combine the speed and reliability of optical actuation with crisp, tactile click feel of mechanical switches | HERO 25K Sensor: Provides ultra-precise tracking up to 25,600 DPI with zero smoothing or filtering | Lightspeed Wireless Technology: 68% faster response rates compared to previous generation | Battery Life: ca 37 hours with RGB fully active and up to 130 hours with it turned off | Customizable Design: 13 programmable controls and a reversible, removable DPI-shift button to suit different grip styles | Physical Profile: 106g and is compatible with PowerPlay wireless charging system
#Samsung-990-PRO-2TB-M.2-NVMe-SSD | Gen4 M.2 Internal Solid State Drive | Read/Write Speeds up to 7,450/6,900 MB/s | Storage Capacities from 1TB to 4TB | Nickel-coated, high-end controller delivers effective thermal control to avoid performance drops mid-project | Samsung Magician Software
#Noctua-NH-D15-G2-CPU-cooler | Next-gen dual tower CPU cooler | Dual NF-A14x25r G2 PWM fans with speed-offset | Eight heatpipes | Asymmetrical fin-stacks fine-tuned to work in tandem with new fans
#Seagate-IRONWOLF-PRO-4TB-SATAIII | 4 TB capacity | 3,5" SATA III interface |7200 RPM | 250 MB/s data transfer speed | Hard drives hold a firm cost-per-terabyte (TB) advantage over SSDs | Hard drive performance benefits from extremely parallel access
#Kingston-FURY-Beast-DDR5 -5600 -MHz -CL40-16Gb | 2 x 16 Gt (1 x 16 Gt) | 5600 MHz DDR5 | CL40-40-40 | 1,1 V; 1,25 V | Intel XMP 3.0 support | On-Die ECC (ODECC) | Asus-ROG-Strix-X870-F-Gaming-WiFi motherboard primary slot not recommended by ASUS due to memory controller limitation | ASUS recommends for a single memory module, to use A2 slot of Asus-ROG-Strix-X870-F-Gaming-WiFi motherboard (the second slot away from CPU) | To achieve full 5600 MHz speed (which is an overclocked profile), you may need to enable EXPO or XMP profile in Asus-ROG-Strix-X870-F-Gaming-WiFi motherboard BIOS settings | Using slots A2 and B2 of Asus-ROG-Strix-X870-F-Gaming-WiFi motherboard (the second and fourth slots from CPU) ensures optimal dual-channel performance | Installing Kingston-FURY-Beast-DDR5 -5600 -MHz -CL40-16Gb on DIMM_A1 of Asus-ROG-Strix-X870-F-Gaming-WiFi motherboard disrupts memory channel optimization, which can lead to boot failures, instability, or significantly lower performance | Asus 870-F uses Daisy-Chain topology, meaning slots furthest from CPU (A2/B2) provide best signal integrity for high-speed DDR5 | bequiet-pro-901 front fan allows for 40 mm of clearance (up to 46.8 mm if fan is moved upward), making 34.9 mm RAM fully compatible | From 32GB, you can expect ~28GB+ to be freely available after Windows11 | 32GB RAM is recommended as Game-assets are becoming larger and due to more complex AI-Integration | NVIDIA Isaac-Sim lists 32GB as a minimum | Installing memory in matched pairs is recommended for Asus motherboard | Windows11 info about RAM.with two Kingston-FURY-Beast-DDR5 -5600 -MHz -CL40-16Gb: In use: 7,8 GB, Available: 22,3 GB, Speed: 4800 Mt/s, Slots used: 2/4, Form factor: DIMM, Hardware reserved: 873 MB | 873 MB of hardware-reserved-memory is primarily RAM set aside by system-BIOS for specific hardware components, most commonly integrated graphics iGPU
#Kingston-NV3-PCIe-4.0-NVMe-1TB-M.2-SSD Kingston NV3 PCIe 4.0 NVMe SSDs are next-gen storage solutions powered by a Gen 4x4 NVMe controller | Delivering read/write speeds of up to 6,000/5,000MB/s | They offer lower power consumption and reduced heat | Capacity: 1 TBRead speed: 6000 MB/sWrite speed: 4000 MB/sPCIe Gen4 ×4 NVMeForm factor: M.2 (2280)Writing: 320 TBWEstimated life time: 2 000 000 hours| Size: 22 x 80 x 2,3 mm
#Asus-XG-C100C-V2-10Gb | 10GBase-T PCIe network adapter | PCI slot in Asus-ROG-Strix-X870-F-Gaming-WiFi motherboard | XG-C100C V2 only requires an x4 connection to operate at full speed | Slot is located below all M.2 slots on motherboard, near the bottom edge of board, left side of CR2032 | Had to bring LAN driver from Windows10 PC to get new Windows11 PC to internet
#Samsung-870.EVO-SSD-4TB-2.5"-SATA3-SSD-hard-disk | SATA III Internal Solid State Drive | Read/Write Speeds up to 560/530 MB/s | Storage Capacity 4TB | Up to 4,800 TBW with 5-year limited warranty | Managing Multiple Environments & Models - AI easier with 4TB | When your data and tools fit comfortably on one drive, you spend less time managing storage space and more time coding and training models
#cuTile | Python-DSL (Domain Specific Language) for writing high-performance CUDA-kernels that target TileIRA
#TileIRA | Virtual ISA for NVIDIA GPUs
#C++Build-Tools | cuTile often requires Visual Studio Build Tools to compile kernels (cl.exe)
#cuTile Python | Organized into several key modules that provide programming model, compilation tools, and core operations for tile-based GPU programming
#cuda.tile | Main user-facing module containing core operations for defining kernels Operations | load | store | matmul | reduce | scan | permute | extract
#Math-Functions of cuda.tile | Standard math operations like sin, cos, exp, and log
#Atomic-Operations of cuda.tile | atomic_add | atomic_and
#cuda.tile.compilation | Handles kernel-export and compilation pipeline, transforming Python-code into CUDA-Tile-IR
#cuda.tile.types | Defines data types and shapes supported within tile IR, such as f32, i32, tile-container
#cuda.tile.execution | Manages execution-model including abstract-machine-representation execution-spaces and tile-grid-configurations
#CUDA-Tile-IR | MLIR-based-intermediate-representation that abstracts-hardware-details like Tensor-Cores to provide portability-across-GPU-architectures
#cuda-tile package | Provides cuda.tile module for Python use
#TileIR-Compiler | cuda-tile package depends on CUDA TileIR compiler tileiras, which is not always included in the base package
#CUDA DLLs | libnvvm.dll | ptxas.exe | in the site-packages folder
#cuda.tile runtime | Will internally call components from nvidia-cuda-nvcc package already installed to compile tiled-kernels
#NVidia-CUDA-Toolkit | Directory C://Program Files//NVIDIA GPU Computing Toolkit//CUDA//v13.2\extras//visual_studio_integration//MSBuildExtensions | CUDA 13.2.props | CUDA 13.2.targets | CUDA 13.2.Version.props | CUDA 13.2.xml | Nvda.Build.CudaTasks.v13.2.dll | CUDA_PATH | CUDA_PATH_V13_2 | CUDA_CACHE_PATH | CUDA_HOME
#Dexterous robot | Manipulate objects with precision, adaptability, and efficiency | Dexterity involves fine motor control, coordination, ability to handle a wide range of tasks, often in unstructured environments | Key aspects of robot dexterity include grip, manipulation, tactile sensitivity, agility, and coordination | Robot dexterity is crucial in: manufacturing, healthcare, logistics | Dexterity enables automation in tasks that traditionally require human-like precision
#Agentic AI | Artificial intelligence systems with a degree of autonomy, enabling them to make decisions, take actions, and learn from experiences to achieve specific goals, often with minimal human intervention | Agentic AI systems are designed to operate independently, unlike traditional AI models that rely on predefined instructions or prompts | Reinforcement learning (RL) | Deep neural network (DNN) | Multi-agent system (MAS) | Goal-setting algorithm | Adaptive learning algorithm | Agentic agents focus on autonomy and real-time decision-making in complex scenarios | Ability to determine intent and outcome of processes | Planning and adapting to changes | Ability to self-refine and update instructions without outside intervention | Full autonomy requires creativity and ability to anticipate changing needs before they occur proactively | Agentic AI benefits Industry 4.0 facilities monitoring machinery in real time, predicting failures, scheduling maintenance, reducing downtime, and optimizing asset availability, enabling continuous process optimization, minimizing waste, and enhancing operational efficiency
#Large Language Model (LLM) | Foundational LLM: ex Wikipedia in all its languages fed to LLM one word at a time | LLM is trained to predict the next word most likely to appear in that context | LLM intellugence is based on its ability to predict what comes next in a sentence | LLMs are amazing artifacts, containing a model of all of language, on a scale no human could conceive or visualize | LLMs do not apply any value to information, or truthfulness of sentences and paragraphs they have learned to produce | LLMs are powerful pattern-matching machines but lack human-like understanding, common sense, or ethical reasoning | LLMs produce merely a statistically probable sequence of words based on their training | LLMs are very good at summarizing | Inappropriate use of LLMs as search engines has produced lots of unhappy results | LLM output follows path of most likely words and assembles them into sentences | Pathological liars as a source for information | Incredibly good at turning pre-existing information into words | Give them facts and let them explain or impart them
#Retrieval Augmented Generation. (RAG LLM) | Designed for answering queries in a specific subject, for example, how to operate a particular appliance, tool, or type of machinery | LLM takes as much textual information about subject, user manuals and then pre-process it into small chunks containing few specific facts | When user asks question, software system identifies chunk of text which is most likely to contain answer | Question and answer are then fed to LLM, which generates human-language answer in response to query | Enforcing factualness on LLMs
#Large Behavior Model (LBM) | Controlling the entire robot actions | Joint research partnership between Boston Dynamics and Toyota Research Institute | Collaboration aims to create a general-purpose humanoid assistant | Whole-body movements: walking, crouching, and lifting to complete tasks that involve sorting and packing
#Vision-language model (VLM) | Training vision models when labeled data unavailable | Techniques enabling robots to determine appropriate actions in novel situations | LLMs used as visual reasoning coordinators | Using multiple task-specific models
#GPU Architecture Blackwell | NVIDIA-SMI: 595.79 | Driver Version: 595.79 | CUDA Version: 13.2 | GPU: NVIDIA GeForce RTX 5060
#Compilers and packages | Python: 3.14.4 | Microsoft 2026 compiler: cl.exe | cuda-bindings: 13.2.0 | cuda-pathfinder: 1.5.2 | cuda-python: 13.2.0 | cuda-tile: 0.0.0a0
#Tool packages | torch: 2.11.0 | pip: 26.0.1 | numpy: 2.4.4
#NVidia packages | nvidia-cuda-crt: 13.2.78 | nvidia-cuda-nvcc: 13.2.78 | nvidia-cuda-runtime: 13.2.75 | nvidia-cuda-tileiras: 13.2.78 | nvidia-nvvm: 13.2.78
#Broad Area Beyond Visual Line of Sight (BVLOS) Self-Assessment Trial | Groundbreaking self-assessment capability gained | Faster Drone-in-a-Box deployments | The first commercial operator approved under CASA new self-assessment trial | Capability to self-assess and commence operations within days | Enterprises operating across multiple sites - particularly in mining, agriculture, and industrial sectors - new projects can start faster, site expansions supported immediately, autonomous drone benefits realised quicker | RocketDNA operated BVLOS missions from Remote Operations Centre (ROC) since 2022 | Developed robust safety management systems | Refined operational procedures through thousands of flights | Built deep expertise in risk assessment and mitigation | Established proven track records with the regulator | Operating safely around active industrial sites and infrastructure | Navigating overlapping safety frameworks (aviation, mining, workplace safety) | Working seamlessly with existing site operations and security protocols | Meeting the uptime and performance standards enterprise customers demand | Deployment and installation at client site | Remote operations from certified Operations Centre | Maintenance and compliance for all equipment | Data processing and delivery in preferred formats | Drone capability can move with operational priorities, not lag weeks behind them | Ability to quickly deploy drone to new locations | Running comparative trials across different sites | Autonomous drones can handle mine stock pile flights automatically on a set schedule | Volume calculations and 3D models are ready within hours | Automated flight patterns give consistent image overlap and quality. | Better data for planning | Fewer surprises at end of month | Docked drone systems sit in weatherproof unit that houses drone and its charging and communications systems | Base station arrives on site pre-mounted on a skid to make installation simple | Off-grid power options like hybrid solar, battery, and diesel | When scheduled or triggered manually, roof slides open, drone takes off, flies a pre-set route, and returns to recharge | Data uploads automatically, streaming live to team via Starlink, LTE/5G, or site network | SurveyBot start from around $50–60k | Rio Tinto Gudai-Darri mine designed around autonomous haulage and remote operating centres | Mine planning, drill and blast, environmental, and geotechnical teams all needed regular surveys and imagery to support their work | Autonomous haulage fleet is running continuously | Reducing the number of manual vehicle movements into the pit | Giving multiple departments faster access to survey imagery and data | Freeing surveyors from routine flying | With more than 190 square kilometres of mine to cover | Automating routine missions with docked drone system, but retaining manual flights for specialised jobs | Satellite connections with Starlink meant system could be brought online with a simple antenna
#Robotics development platform | Autonomous mobile robots (AMRs) | Robot arms | Manipulators | Humanoids | Simulation | Robot learning frameworks | GPU accelerated libraries | AI models | Reference workflows
#Extended Robust Aerial Autonomy
#cuda-tile | python -m pip install cuda-tile=1.0.0rc6 | CUDA Tile Compiler | cuTile Python is a programming language for NVIDIA GPUs | C++17-capable compiler (GNU C++ or MSVC) | CMake 3.18+ | GNU Make (Linux) or msbuild (Windows) | Python 3.10+ with virtual environent |CUDA Toolkit 13.1+ | python -m install -r test/requirements.txt | pytest test/test_copy.py
#cutile-python | cuTile Python depends on CUDA TileIR compiler tileiras, which futher depends on ptxas and libnvvm from CUDA Toolkit | CUDA Toolkit (13.1+) installed you can install cuTile Python as a standalone package. cuTile automatically searches for tileiras from ocation of CUDA Toolkit |python -m pip install cupy-cuda13x |
#forum-developer-cutile | Enhanced CUDA Python profiling | Modern CUDA C++ and refreshed math libraries optimized for AI and HPC kernels
#tech-cuda-tile | cuTile Python: enhanced Array support ... | memcpy with attributes | Per-context local memory footprint reduction in Windows | Query the properties of a memory pool | CUDA Graphs polymorphic function | Compiler updates | Supporting NVIDIA Jetson Orin devices on the same CUD | Allowing GPU integrated with Jetson Thor to be partitioned into two fully isolated instances | Humanoid robotics developers can isolate safety-critical workloads (motor control and safety systems, from noncritical processing tasks | NVIDIA Nsight Python | Numba-CUDA debugging | New algorithms
#cuTile requirements | nvidia-cuda-tileiras | nvidia-cuda-nvcc | nvidia-nvvm
#CUDA-Toolkit binaries | Compiler tileiras depends on ptxas, libnvvm from system-wide CUDA-Toolkit
#Driver | NVIDIA-SMI 595.79 | Driver Version: 595.79 | CUDA Version: 13.2 | NVIDIA GeForce RTX 5060
#nvcc | C://Program Files//NVIDIA GPU Computing Toolkit//CUDA//v13.2//bin//nvcc.exe | nvcc --version nvcc NVIDIA (R) Cuda compiler driver | Copyright (c) 2005-2026 NVIDIA Corporation | Built on Mon_Mar__2_21:54:11_Pacific_Standard_Time_2026 | Cuda compilation tools, release 13.2, V13.2.51 | Build cuda_13.2.r13.2/compiler.37434383_0
#TileGym | python -m pip install tilegym[tileiras] | Installed packages | tzdata | six | shellingham | safetensors | regex | pyyaml | pyparsing | pygments | pillow | packaging | mdurl | kiwisolver | idna | hf-xet | h11 | fonttools | filelock | cycler | contourpy | colorama | certifi | annotated-doc | tqdm | python-dateutil | markdown-it-py | httpcore | click | anyio | rich | pandas | matplotlib | httpx | typer | huggingface_hub | tokenizers | transformers | tilegym
#PyTorch | Forcing without cacheupdate to PyTorch 2.11.0 | python -m pip install --no-cache-dir torch==2.11.0 --index-url download.pytorch.org/whl/cu130 | Successfully installed | torch-2.10.0+cu130 | torchaudio-2.11.0+cu130 | torchvision-0.25.0+cu130
#cuTile Python tests | cuTile uses pytest-framework for testing | Tests are located in test/ directory | To run a specific test file, for example test_copy.py, use the following command: | pytest test/test_copy.py
#Tensor on GPU example | imported-torch | CUDA available | GPU Name: NVIDIA GeForce RTX 5060 | PyTorch CUDA version: 13.0 | Tensor on GPU: tensor([1.0, 2.0], device=cuda:0)
#Upgrading cuda-tile package | Including tileiras compiler directly in virtual environment Scripts folder | Python 3.14 command | python -m pip install --upgrade --force-reinstall --no-cache-dir cuda-tile[tileiras]
#TileGym install | Ensuring that all necessary runtime dependencies including tileiras-compiler are present in your environment | python -m pip install tilegym[tileiras]
#Virtual environment installation Location | Executing commands using Python interpreter within virtual environment Scripts folder | Packages will be installed to site-packages subfolder of Lib | python -m pip install --upgrade --no-cache-dir --force-reinstall "cuda-tile[tileiras]" torch torchvision --index-url pytorch.org
#Matrix multiplication | Operation that is basis for solving systems of equations | Underpins graphics, simulations, optimization, and most of machine learning | Given input matrices A (MxK) and B (KxN), element of C is computed by taking dot product of a row of A and a column of B