NVIDIA NIM

integrate.api.nvidia.com

Models: 201 models
From: --
Speed: 66 tok/s
Updated: 7/8/2026

NVIDIA NIM provides optimized AI model inference APIs for LLMs, vision, and embedding models through NVIDIA cloud infrastructure.

Latency8.14 s

Created At8/13/2025

Website

API Endpoints

Endpoint 1
integrate.api.nvidia.com

Notes

Health checks: Scope: the 72-hour chart and recent availability measure API connectivity only. Each bar summarizes one hour of checks. Targets: LMSpeed tries the configured health check URL and provider status URL first, then API endpoints derived from known API hosts and recent speed-test base URLs. A website host is considered only when it looks like an API endpoint. Probe steps: each candidate goes through DNS lookup, TCP connection, TLS handshake for HTTPS, and an HTTP HEAD request with redirects followed. Probing stops after the first reachable candidate. Reachable criteria: every required network step must succeed. An HTTP response below 500 is treated as reachable, including 401 because it confirms that an authenticated API endpoint responded, except for statuses classified as blocked. Blocked results: HTTP 403, 429, 521, 525, and 530, plus detected WAF or Cloudflare challenges, are shown as blocked and excluded from availability calculations because LMSpeed cannot determine whether the API itself is down. Model availability: when a dedicated test key is configured, LMSpeed sends an authenticated GET request to a derived /models endpoint and compares returned model IDs with this provider's listed models. These per-model results appear in Models & Pricing and are not included in the provider connectivity percentage. Timeouts: TCP connection, TLS handshake, HTTP connectivity, and model requests each use a 5-second timeout. A full run can take longer when several candidates are tried. Frequency: a background worker checks all providers hourly by default. The schedule may be changed by the service operator, so timestamps show when checks actually ran. Limit: automated samples are not an SLA and do not guarantee account quota, every model, every region, or successful completion requests. Check the provider's own status page before making operational decisions.

Notes

Health checks: Scope: the 72-hour chart and recent availability measure API connectivity only. Each bar summarizes one hour of checks. Targets: LMSpeed tries the configured health check URL and provider status URL first, then API endpoints derived from known API hosts and recent speed-test base URLs. A website host is considered only when it looks like an API endpoint. Probe steps: each candidate goes through DNS lookup, TCP connection, TLS handshake for HTTPS, and an HTTP HEAD request with redirects followed. Probing stops after the first reachable candidate. Reachable criteria: every required network step must succeed. An HTTP response below 500 is treated as reachable, including 401 because it confirms that an authenticated API endpoint responded, except for statuses classified as blocked. Blocked results: HTTP 403, 429, 521, 525, and 530, plus detected WAF or Cloudflare challenges, are shown as blocked and excluded from availability calculations because LMSpeed cannot determine whether the API itself is down. Model availability: when a dedicated test key is configured, LMSpeed sends an authenticated GET request to a derived /models endpoint and compares returned model IDs with this provider's listed models. These per-model results appear in Models & Pricing and are not included in the provider connectivity percentage. Timeouts: TCP connection, TLS handshake, HTTP connectivity, and model requests each use a 5-second timeout. A full run can take longer when several candidates are tried. Frequency: a background worker checks all providers hourly by default. The schedule may be changed by the service operator, so timestamps show when checks actually ran. Limit: automated samples are not an SLA and do not guarantee account quota, every model, every region, or successful completion requests. Check the provider's own status page before making operational decisions.

Model

Input ($/M)

Output ($/M)

Audit

Speed

Latency

z-ai/glm-5.2

1006884100

81.1 t/s

1.05 s

google/diffusiongemma-26b-a4b-it

—

697.7 t/s

0.70 s

nvidia/nemotron-3-ultra-550b-a55b

—

79.4 t/s

1.70 s

stepfun-ai/step-3.7-flash

—

118.6 t/s

15.57 s

mistralai/mistral-medium-3.5-128b

—

46.8 t/s

10.40 s

nvidia/nemotron-3-nano-omni-30b-a3b-reasoning

—

274.5 t/s

0.66 s

google/gemma-4-31b-it

1006886100

20.4 t/s

26.82 s

meta/llama-3.1-8b-instruct

—

145.7 t/s

0.21 s

meta/llama-3.3-70b-instruct

—

28.1 t/s

5.27 s

mistralai/mistral-small-4-119b-2603

—

120.9 t/s

0.29 s

nvidia/nemotron-mini-4b-instruct

—

86.6 t/s

0.45 s

z-ai/glm-5.1

848480100

82.4 t/s

9.31 s

nvidia/nemotron-3-nano-30b-a3b

—

207.0 t/s

1.13 s

microsoft/phi-4-multimodal-instruct

—

83.2 t/s

0.39 s

deepseek-ai/deepseek-v4-pro

727280100

21.4 t/s

1.34 s

deepseek-ai/deepseek-v4-flash

1006878100

22.8 t/s

14.36 s

moonshotai/kimi-k2.6

—

42.3 t/s

32.01 s

nvidia/nemotron-3-super-120b-a12b

—

28.5 t/s

31.58 s

marin/marin-8b-instruct

—

84.2 t/s

0.44 s

qwen/qwen3-next-80b-a3b-thinking

—

106.6 t/s

9.02 s

meta/llama-4-maverick-17b-128e-instruct

—

92.8 t/s

0.27 s

institute-of-science-tokyo/llama-3.1-swallow-70b-instruct-v0.1

—

19.0 t/s

0.49 s

meta/llama-3.2-90b-vision-instruct

—

16.9 t/s

0.69 s

abacusai/dracarys-llama-3.1-70b-instruct

—

18.5 t/s

0.50 s

meta/llama-3.1-405b-instruct

—

22.7 t/s

5.87 s

qwen/qwen3-coder-480b-a35b-instruct

—

49.3 t/s

2.07 s

meta/llama3-70b-instruct

—

37.5 t/s

0.47 s

minimaxai/minimax-m2.7

—

31.3 t/s

16.79 s

z-ai/glm5

—

24.9 t/s

35.20 s

stockmark/stockmark-2-100b-instruct

—

55.3 t/s

0.74 s

nvidia/llama-3.1-nemotron-ultra-253b-v1

—

45.2 t/s

0.21 s

qwen/qwen3.5-122b-a10b

—

45.2 t/s

2.90 s

qwen/qwen3.5-397b-a17b

848480100

28.0 t/s

13.61 s

minimaxai/minimax-m2.5

—

78.8 t/s

3.19 s

stepfun-ai/step-3.5-flash

—

81.8 t/s

4.79 s

moonshotai/kimi-k2.5

—

47.9 t/s

18.53 s

z-ai/glm4.7

—

57.0 t/s

27.72 s

minimaxai/minimax-m2.1

—

86.4 t/s

2.88 s

01-ai/yi-large

—

43.7 t/s

0.22 s

ai21labs/jamba-1.5-large-instruct

—

55.6 t/s

0.29 s

deepseek-ai/deepseek-v3.1

—

41.4 t/s

2.36 s

google/gemma-2-27b-it

—

43.7 t/s

0.23 s

google/gemma-3-27b-it

—

50.8 t/s

0.47 s

meta/llama-3.1-70b-instruct

—

51.2 t/s

0.23 s

microsoft/phi-3-medium-128k-instruct

—

18.1 t/s

0.49 s

microsoft/phi-4-mini-flash-reasoning

—

74.2 t/s

0.46 s

mistralai/mistral-small-24b-instruct

—

29.7 t/s

0.49 s

mistralai/mixtral-8x22b-instruct-v0.1

—

89.7 t/s

0.22 s

moonshotai/kimi-k2-instruct

—

45.1 t/s

0.58 s

nvidia/llama-3.1-nemotron-70b-instruct

—

52.2 t/s

0.23 s

nvidia/llama-3.3-nemotron-super-49b-v1.5

—

57.5 t/s

11.11 s

deepseek-ai/deepseek-v3.2

—

17.9 t/s

4.49 s

moonshotai/kimi-k2-thinking

—

41.3 t/s

19.45 s

openai/gpt-oss-120b

—

152.8 t/s

0.95 s

openai/gpt-oss-20b

7684100100

162.7 t/s

1.12 s

qwen/qwen3-235b-a22b

—

44.5 t/s

21.25 s

deepseek-ai/deepseek-r1

—

87.6 t/s

8.96 s

deepseek-ai/deepseek-r1-distill-qwen-14b

—

41.0 t/s

0.49 s

deepseek-ai/deepseek-r1-distill-qwen-32b

—

34.0 t/s

0.59 s

Time

Model

Speed

Latency

Jul 5, 05:47 AM

z-ai/glm-5.2

24.30 tok/s

1.40s

Jul 3, 03:37 AM

z-ai/glm-5.2

137.86 tok/s

0.70s

Jul 2, 04:51 AM

deepseek-ai/deepseek-v4-flash

13.40 tok/s

57.53s

Jul 2, 04:49 AM

qwen/qwen3.5-122b-a10b

58.72 tok/s

1.67s

Jul 2, 04:42 AM

deepseek-ai/deepseek-v4-pro

21.19 tok/s

1.40s

Jul 2, 04:41 AM

mistralai/mistral-small-4-119b-2603

120.92 tok/s

0.29s

Jul 2, 04:37 AM

mistralai/mistral-medium-3.5-128b

46.84 tok/s

10.40s

Jul 2, 04:34 AM

nvidia/nemotron-3-ultra-550b-a55b

36.46 tok/s

2.98s

Jul 2, 04:33 AM

moonshotai/kimi-k2.6

68.46 tok/s

10.97s

Jul 2, 04:32 AM

google/diffusiongemma-26b-a4b-it

697.68 tok/s

0.70s

Provider

Why compare

Models

Free

Avg price

Speed

30d uptime

NVIDIA NIM

nvidia-nim

NVIDIA NIM provides optimized AI model inference APIs for LLMs, vision, and embedding models through NVIDIA cloud infrastructure.

Current provider baseline

N/A

66 tok/s

66.7%

OpenRouter

openrouter

A unified API interface providing access to over 300 models from 60+ providers, including OpenAI, Anthropic, and Google.

Faster measured speed
Broader model coverage
Same provider category

191

N/A

86 tok/s

66.7%

siliconflow

Provides cost-effective generative AI cloud services based on open-source models for text, image, video, and audio generation.

Broader model coverage
Same provider category

N/A

53 tok/s

66.7%

api-kriora-com

Provides OpenAI-compatible APIs and managed GPU instances for deploying and scaling open-source AI models.

Faster measured speed
Same provider category

N/A

565 tok/s

66.7%

router-huggingface-co

Hugging Face Router provides intelligent model routing across Inference Providers, offering OpenAI-compatible API access to open-source models.

Faster measured speed
Same provider category

N/A

138 tok/s

40.7%

api-fireworks-ai

Fireworks AI provides a cloud platform for running and fine-tuning open-source AI models with optimized inference for production applications.

Faster measured speed
Same provider category

N/A

139 tok/s

66.7%

550c-cloud

共绩算力 (550c.cloud) is a shared computing platform providing Ollama-hosted open-source AI model inference via OpenAI-compatible API.

Faster measured speed
Same provider category

N/A

112 tok/s

63%

NVIDIA NIM

API Endpoints

Notes

NVIDIA NIM

API Endpoints

Notes

Health Check

API Benchmarks & Pricing

Recent Test Records

Similar API Provider Alternatives to Compare

Similar API Provider Alternatives to Compare

Provider	Why compare	Models	Avg price	Speed	30d uptime
NVIDIA NIM nvidia-nim NVIDIA NIM provides optimized AI model inference APIs for LLMs, vision, and embedding models through NVIDIA cloud infrastructure.	Current provider baseline	51	N/A	66 tok/s	66.7%
OpenRouter openrouter A unified API interface providing access to over 300 models from 60+ providers, including OpenAI, Anthropic, and Google.	Faster measured speed Broader model coverage Same provider category	191	N/A	86 tok/s	66.7%
siliconflow Provides cost-effective generative AI cloud services based on open-source models for text, image, video, and audio generation.	Broader model coverage Same provider category	60	N/A	53 tok/s	66.7%
api-kriora-com Provides OpenAI-compatible APIs and managed GPU instances for deploying and scaling open-source AI models.	Faster measured speed Same provider category	9	N/A	565 tok/s	66.7%
router-huggingface-co Hugging Face Router provides intelligent model routing across Inference Providers, offering OpenAI-compatible API access to open-source models.	Faster measured speed Same provider category	7	N/A	138 tok/s	40.7%
api-fireworks-ai Fireworks AI provides a cloud platform for running and fine-tuning open-source AI models with optimized inference for production applications.	Faster measured speed Same provider category	5	N/A	139 tok/s	66.7%
550c-cloud 共绩算力 (550c.cloud) is a shared computing platform providing Ollama-hosted open-source AI model inference via OpenAI-compatible API.	Faster measured speed Same provider category	4	N/A	112 tok/s	63%

NVIDIA NIM

API Endpoints

Notes

NVIDIA NIM

API Endpoints

Notes

About NVIDIA NIM

Health Check

API Benchmarks & Pricing

Recent Test Records

Similar API Provider Alternatives to Compare

Similar API Provider Alternatives to Compare