2 min read time

How to Perform Load Testing on Ollama Service Using LoadRunner™︎ Solutions?

by   in DevOps Cloud

If you want to do a load test on a large language model running with Ollama, LoadRunner solutions can help you.

How To Access Ollama Service

Ollama service provides access via HTTP/HTTPS protocol, which LoadRunner solutions support well. Suppose that you have deployed the Ollama service with llama2 in your local host, and talk to it with a simple question:

curl http://localhost:11434/api/generate -d '{
     "model": "llama2",
     "prompt": "hi?",
     "stream": false
   }'

Response:

{"model":"llama2","created_at":"2024-07-09T08:33:58.4645772Z","response":"Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?","done":true,"context":[518,25580,29962,3532,14816,29903,29958,5299,829,14816,29903,6778,13,13,2918,29973,518,29914,25580,29962,13,10994,29991,739,29915,29879,7575,304,5870,366,29889,1317,727,1554,306,508,1371,366,411,470,723,366,763,304,13563,29973],"total_duration":3729212400,"load_duration":3926700,"prompt_eval_count":7,"prompt_eval_duration":758750000,"eval_count":26,"eval_duration":2965213000}

Note:

  • "stream": false: the response is returned as a single response object, rather than a stream of objects.
  • eval_count: number of tokens in the response.
  • eval_duration: time in nanoseconds spent generating the response.

Do a Load Test with your LoadRunner solution

Now, let's take a look at how to simulate the talk and measure Ollama's speed in the LoadRunner script: Action.c

float calcSpeed(int eval_count , float eval_duration)
{
    return eval_count / eval_duration;
}

Action()
{
    char* str_count = NULL;
    char* str_duration = NULL;
    int icount = 0;
    int iduration = 0;
    float response_eval_speed = 0;
    int i_prompt_count = 0;
    int i_prompt_duration = 0;
    float prompt_eval_speed = 0;
    float model_load_time = 0.0;
    int load_duration= 0;

    web_reg_save_param_json(
        "ParamName=eval_duration",
        "QueryString=$..eval_duration",
        "NotFound=warning",
        "SelectAll=Yes",
        SEARCH_FILTERS,
        "Scope=BODY",
        LAST);
    web_reg_save_param_json(
        "ParamName=eval_count",
        "QueryString=$..eval_count",
        "NotFound=warning",
        "SelectAll=Yes",
        SEARCH_FILTERS,
        "Scope=BODY",
        LAST);

    web_reg_save_param_json(
        "ParamName=prompt_eval_count",
        "QueryString=$..prompt_eval_count",
        "NotFound=warning",
        "SelectAll=Yes",
        SEARCH_FILTERS,
        "Scope=BODY",
        LAST);

    web_reg_save_param_json(
        "ParamName=prompt_eval_duration",
        "QueryString=$..prompt_eval_duration",
        "NotFound=warning",
        "SelectAll=Yes",
        SEARCH_FILTERS,
        "Scope=BODY",
        LAST);
    
    web_reg_save_param_json(
        "ParamName=load_duration",
        "QueryString=$..load_duration",
        "NotFound=warning",
        "SelectAll=Yes",
        SEARCH_FILTERS,
        "Scope=BODY",
        LAST);
    
   web_custom_request("GenerateText",
       "URL=http://localhost:11434/api/generate",
       "Method=POST",
       "Resource=0",
       "Body={ \"model\": \"llama2\", \"stream\": false, \"prompt\": \"how are you?\" }",
       LAST);

    // evaluate speed
    str_count = lr_eval_string("{eval_count_1}");
    str_duration = lr_eval_string("{eval_duration_1}");
    icount = atoi(str_count);
    iduration = atoi(str_duration);
    response_eval_speed = calcSpeed(icount, iduration / 1000000000);

    // prompt evaluate speed
    i_prompt_count = atoi(lr_eval_string("{prompt_eval_count_1}"));
    i_prompt_duration = atoi(lr_eval_string("{prompt_eval_duration_1}"));
    prompt_eval_speed = calcSpeed(i_prompt_count, i_prompt_duration / 1000000000);

    // model loading time
    load_duration = atoi(lr_eval_string("{load_duration_1}"));
    model_load_time =  ((float)load_duration) / 1000000000;

    lr_user_data_point("response eval speed(token/s)", response_eval_speed);
    lr_user_data_point("prompt eval speed(token/s)", prompt_eval_speed);
    lr_user_data_point("model loading time(s)", model_load_time);

    return 0;
}

Note:

  • web_reg_save_param_json: parse response and get value from it.
  • lr_user_data_point: report metrics data to show it in the chart.

Run Script And Analyze Metrics

Run the script in Controller as a Scenario. You can see the metrics in the graphs:

Note:

  • Trans Response Time: Since we defined a conversation in the Action(), the time of each transaction is the time spent in talk.
  • Throughput: throughput of Ollama.
  • User Defined Data Points: show the data we reported from the script. Duplicate the chart to show Time and Speed separately.

Labels:

Performance Engineering