If you want to do a load test on a large language model running with Ollama, LoadRunner solutions can help you.
How To Access Ollama Service
Ollama service provides access via HTTP/HTTPS protocol, which LoadRunner solutions support well. Suppose that you have deployed the Ollama service with llama2 in your local host, and talk to it with a simple question:
curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "hi?", "stream": false }'
Response:
{"model":"llama2","created_at":"2024-07-09T08:33:58.4645772Z","response":"Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?","done":true,"context":[518,25580,29962,3532,14816,29903,29958,5299,829,14816,29903,6778,13,13,2918,29973,518,29914,25580,29962,13,10994,29991,739,29915,29879,7575,304,5870,366,29889,1317,727,1554,306,508,1371,366,411,470,723,366,763,304,13563,29973],"total_duration":3729212400,"load_duration":3926700,"prompt_eval_count":7,"prompt_eval_duration":758750000,"eval_count":26,"eval_duration":2965213000}
Note:
- "stream": false: the response is returned as a single response object, rather than a stream of objects.
- eval_count: number of tokens in the response.
- eval_duration: time in nanoseconds spent generating the response.
Do a Load Test with your LoadRunner solution
Now, let's take a look at how to simulate the talk and measure Ollama's speed in the LoadRunner script: Action.c
float calcSpeed(int eval_count , float eval_duration)
{
return eval_count / eval_duration;
}
Action()
{
char* str_count = NULL;
char* str_duration = NULL;
int icount = 0;
int iduration = 0;
float response_eval_speed = 0;
int i_prompt_count = 0;
int i_prompt_duration = 0;
float prompt_eval_speed = 0;
float model_load_time = 0.0;
int load_duration= 0;
web_reg_save_param_json(
"ParamName=eval_duration",
"QueryString=$..eval_duration",
"NotFound=warning",
"SelectAll=Yes",
SEARCH_FILTERS,
"Scope=BODY",
LAST);
web_reg_save_param_json(
"ParamName=eval_count",
"QueryString=$..eval_count",
"NotFound=warning",
"SelectAll=Yes",
SEARCH_FILTERS,
"Scope=BODY",
LAST);
web_reg_save_param_json(
"ParamName=prompt_eval_count",
"QueryString=$..prompt_eval_count",
"NotFound=warning",
"SelectAll=Yes",
SEARCH_FILTERS,
"Scope=BODY",
LAST);
web_reg_save_param_json(
"ParamName=prompt_eval_duration",
"QueryString=$..prompt_eval_duration",
"NotFound=warning",
"SelectAll=Yes",
SEARCH_FILTERS,
"Scope=BODY",
LAST);
web_reg_save_param_json(
"ParamName=load_duration",
"QueryString=$..load_duration",
"NotFound=warning",
"SelectAll=Yes",
SEARCH_FILTERS,
"Scope=BODY",
LAST);
web_custom_request("GenerateText",
"URL=http://localhost:11434/api/generate",
"Method=POST",
"Resource=0",
"Body={ \"model\": \"llama2\", \"stream\": false, \"prompt\": \"how are you?\" }",
LAST);
// evaluate speed
str_count = lr_eval_string("{eval_count_1}");
str_duration = lr_eval_string("{eval_duration_1}");
icount = atoi(str_count);
iduration = atoi(str_duration);
response_eval_speed = calcSpeed(icount, iduration / 1000000000);
// prompt evaluate speed
i_prompt_count = atoi(lr_eval_string("{prompt_eval_count_1}"));
i_prompt_duration = atoi(lr_eval_string("{prompt_eval_duration_1}"));
prompt_eval_speed = calcSpeed(i_prompt_count, i_prompt_duration / 1000000000);
// model loading time
load_duration = atoi(lr_eval_string("{load_duration_1}"));
model_load_time = ((float)load_duration) / 1000000000;
lr_user_data_point("response eval speed(token/s)", response_eval_speed);
lr_user_data_point("prompt eval speed(token/s)", prompt_eval_speed);
lr_user_data_point("model loading time(s)", model_load_time);
return 0;
}
Note:
- web_reg_save_param_json: parse response and get value from it.
- lr_user_data_point: report metrics data to show it in the chart.
Run Script And Analyze Metrics
Run the script in Controller as a Scenario. You can see the metrics in the graphs:
Note:
- Trans Response Time: Since we defined a conversation in the Action(), the time of each transaction is the time spent in talk.
- Throughput: throughput of Ollama.
- User Defined Data Points: show the data we reported from the script. Duplicate the chart to show Time and Speed separately.