Skymizer’s HTX301 PCIe AI card claims 700B model inference with 384GB memory at 240W

news
Skymizer’s HTX301 PCIe AI card claims 700B model inference with 384GB memory at 240W

Skymizer has announced the HTX301, a PCIe AI accelerator card designed to run large language models locally without needing a large GPU cluster.

The Taiwan based company says the HTX301 can run inference for 700B parameter models on a single PCIe card. If the claims hold up in real testing, that could make the card interesting for companies that want on premises AI without the cost, power draw, and complexity of large accelerator systems.

The HTX301 is built on Skymizer’s HyperThought platform and uses the company’s next generation LPU IP. It is aimed at LLM inference rather than broad GPU compute, with a focus on decode acceleration, prefill and decode orchestration, low latency, and fixed infrastructure costs.

Each PCIe card uses six HTX301 chips and includes up to 384GB of memory. Skymizer is not using HBM, GDDR, or LPDDR5X here. The card uses standard LPDDR4 and LPDDR5 memory, which helps keep power and cost lower.

The power figure is one of the biggest claims. Skymizer says the card runs at around 240W, which is much lower than high end PCIe AI accelerators such as AMD’s Instinct MI350P or NVIDIA’s RTX PRO 6000 Blackwell server card.

Here is a quick look at the HTX301:

FeatureDetails
ProductSkymizer HTX301
TypePCIe AI accelerator
Main useLocal LLM inference
Maximum model classUp to 700B parameters
Chips per card6 HTX301 chips
MemoryUp to 384GB
Memory typeLPDDR4 and LPDDR5
PowerAround 240W
PlatformHyperThought with next generation LPU IP
Target marketOn premises AI and enterprise inference

Skymizer also says its LPU design is efficient enough to reach 30 tokens per second with only 0.5 TOPS and 100GB/s of bandwidth. For Llama 2 7B prefill, the company claims an octa core LPU can reach 240 tokens per second, with multi chip scaling reaching up to 1,200 tokens per second.

The company is also using compression to reduce memory and bandwidth pressure. Its weight compression is claimed to perform better than llama.cpp by 9 percent to 17.8 percent, while KV cache compression is said to keep perplexity loss low.

The main appeal is clear. Many businesses want to run AI locally for privacy, predictable latency, and control over data. A single PCIe card that can handle very large models at 240W would be much easier to deploy than a rack full of high power GPUs.

There is still reason to be cautious. These are company claims, and the HTX301 needs independent testing before it can be judged against established GPU based systems. It also appears focused on inference, not full scale model training.

Skymizer plans to show the HTX301 at Computex. Until real benchmarks arrive, it is best viewed as a promising on premises AI accelerator that could matter if its 700B model, memory, and power claims are proven in practice.

Discover: News

Discussion (0)

Be the first to comment.