Preparing Spreadsheet Data for AI Fine-tuning

Preparing Spreadsheet Data for AI Fine-tuning

Fine-tuning offers many benefits over prompt engineering when it comes to building with text-generative AI. You cut down on tokens used, generally making outputs cheaper and you also get higher quality outputs, a win-win. The only issue with fine-tuning is the act of actually creating datasets. We tried to fix this a little with our Dataset Studio, but talking with other users found that some prefer building out datasets in spreadsheets.

This called for us to build out a new solution. Introducing Riku's CSV to JSONL converter, a helpful tool to turn your spreadsheet data into JSONL format so it can be used for fine-tuning with both AI21 and OpenAI without a single line of code. Sounds good to us!

Why build out a dataset in a spreadsheet?

You might be thinking what can I gain from building out a dataset in a spreadsheet, it is a valid question but all becomes clear when you combine this workflow with others. When building out datasets, the general consensus is the more data, the better.

Remember, when you build out a prompt you are limited by the token limit, which is approximately 8,000 characters total for your instructions, prompt, examples, output examples, and then the output that you want the AI to generate. It gets limited pretty quickly.

With fine-tuning, each of the examples can be up to this 2,048 token limit so you can get a heck of a lot more context and data that isn't possible otherwise. I always recommend trying a fine-tune with 100 examples, then benchmarking against 200 and 500 and more, and more if you have an extensive dataset. Choose numbers, do fine-tuning, check the performance and see if you get what you need.

By using a spreadsheet, you can speed up the dataset creation process. You can check out Riku's Google Sheet integration which enables you to use AI in Google Sheet so you can effectively bulk produce your dataset and then quality check it afterward before processing your fine-tuning.

How to structure your spreadsheet for conversion?

This is very simple to do. You need to make a new sheet, and put two headers in the first row. One of these is "Prompt", the other is "Completion". Put your data below these two with the prompt in the first column and the completion in the second column.

Remember, the prompt is what you would be giving to the AI to work magic with, the completion is what the AI gives you back. It is always good practice to put some separator at the end of your prompt. What I like to use and what we encourage at Riku is to use linebreak, ###, linebreak or you can add this in JSON safe format as \n###\n.

Once your spreadsheet is formatted correctly, export it as a CSV. You will have a file then which we can work with in Riku. Open up our website, login and you will be on the Dashboard. Click Dataset Studio in the side menu and you will see the following screen.

At the top of the image in the buttons, you will see CSV to JSONL converter. This is what we're interested in here!

Click CSV to JSONL converter in the top bar and you will be able to upload your CSV file, you can also give it a name that you can use to remember what the dataset is in the future. When you are ready, hit the green button to start the conversion process. You will see your dataset appear below and you can then delete it if you wish to, or download the JSONL file.

When you hit the green button, what happens in the backend, is that we run a script on your CSV file to go through line by line and construct the JSONL. This can take a few minutes if your dataset is a big one so on average think about waiting a minute per every 100 entries and then hitting download.

When you do download the dataset, you will have a JSONL file that is ready for fine-tuning. It will be formatted perfectly and ready to use. With Riku, we want to create these tools to make each step of the process as easy as possible for you. Building datasets shouldn't be difficult with spreadsheets or in the Dataset Studio. Fine-tuning shouldn't be difficult with our no-code fine-tuning solution. Deploying that fine-tuned model shouldn't be difficult with our public share links and whitelabel solution.

If you are interested in getting involved in AI but don't know where to start, consider signing up for Riku today. We make learning, exploring, experimenting, and deploying AI as simple as possible with the best large language models all available in a single place online.  Considering signing up today at Riku.AI.