Custom LLM Training

Test hypotheses with ChatGPT, but sell enterprise Gen AI products at scale using your own LLM. Get your own LLM, trained on proprietary data and delivering results through a custom chatbot or integrated into your apps.

healthcare software development services
  • Warranty Period
  • 20+ years in business

Value of Our Custom LLM Training

Belitsoft offers specialist-led LLM training services exclusively for our clients, based on their internal corporate data, guaranteeing 100% security of this data and protection of commercial information from leakage. We train and deploy AI models on-premises. Fine-tune them on internal documents, policies, and workflows to deliver domain-specific, contextually relevant and accurate answers. Your data remains in your control, stored and processed within your infrastructure, avoiding third-party services. You get a self-hosted chatbot with a customizable interface, GDPR/HIPAA-compliant.

Get your own LLMarrow right
  • Your Own Chatbot
  • Your Own LLM Model
  • Trained on Your Own Data
  • Hosted on a Server Controlled by You

How We Build Your AI Chatbots

We provide a self-hosted AI chatbot platform powered by LLMs that we train on your internal knowledge base. Featuring a ready-to-use yet fully customizable interface, it is designed for seamless integration into your workflows—ready to deploy with your branding and, providing accurate, context-aware responses.

Custom Self-Hosted AI Chatbots

We use open-source models like LLaMA to tailor and train them on your internal data—documents, FAQs, knowledge bases, and industry-specific content—so that they understand your terminology and processes.

Deployed as a chatbot within your secure internal network, our solution includes an easy-to-use interface.

Context Aware Assistance

The chatbot interprets free-form queries and delivers accurate, pre-trained responses instantly.

We apply NLP capabilities that allow chatbots to understand user intent, context, and sentiment, and they can ask questions in their own words without rigid or predefined input formats. This eliminates the need to navigate complex systems or wait for human support.
Online Course

ERP/CRM/HR Integration & Process Automation

When a user asks a question, the chatbot uses the RAG (Retrieval-Augmented Generation) method to dynamically decide whether to fetch real-time data (e.g., inventory levels from an ERP) or initiate a process (e.g., submit an HR ticket or process a customer request via APIs). It is powered by VectorDB (vector databases) for efficient data retrieval. We consolidate data from multiple systems (ERPs, CRMs, HR tools) into a single conversational interface to avoid app switching and reduce reliance on managers or support teams.
  • Pulled up-to-date data from business systems is presented in plain language, making it easily accessible to non-technical staff
  • Repetitive tasks are automated, including creating reports and order updates
  • Daily tasks are simplified with live guidance and reduced completion times

How we protect your data

We offer a secure alternative to external generative AI services like GPT, that risk data leakage, share sensitive data externally and potentially use it to train public models. Our self-hosted setup ensures that no data—such as personally identifiable information (PII), protected health information (PHI), or financial records—leaves your environment: no third-party servers, no external APIs, and no exposure of confidential information.

For industries handling sensitive data, like healthcare or finance—where a single breach can cost millions and damage trust—or businesses that need domain-specific responses (e.g., internal IT support, supply chain queries), we integrate into internal workflows and deliver enterprise-grade security.

We maintain privacy and security through:

secure AES encryption

safeguards stored and transmitted data, so even if someone intercepts it, they can’t access it without the proper decryption keys

interaction Role-based access
controls

ensure only the right people can access specific chatbot functions or data—whether it is an administrator managing settings or an employee using it for daily task

recommended Real-time monitoring

keeps an eye on performance, detects unusual activity, and sends alerts for any potential security threats

Fully isolated environments also help meet regulatory requirements like GDPR, HIPAA, and SOC-2 compliance more easily, as all data remains on-premise without third-party involvement.

Data Feeding

For precise, context-aware responses, we train self-hosted AI chatbots on your internal data—documents, FAQs, logs, and proprietary content. Using RAG-based retrieval and document uploads, the chatbot doesn’t just repeat memorized answers. Instead, it searches your uploaded data in real time and generates tailored responses. Admins upload PDFs, text files, or web content via tools like OpenWebUI. The system then indexes these documents, turning them into a searchable knowledge base. If you need to train the chatbot on a website or internal wiki, it allows to “crawl” pages or upload CSVs to build and expand a custom repository.

...

Iterative fine-tuning

Our AI engineers refine chatbots using supervised training cycles, where real-world interactions and updated datasets help the AI adjust to your industry jargon, workflows, and evolving business needs without starting from scratch. With prompt engineering techniques, including few-shot learning (teaching niche tasks with 3-5 examples) and chain-of-thought prompting ("show your work" logic) structure responses for appropriate use cases, help the chatbot interpret complex queries with greater contextual accuracy. A well-crafted prompt keeps chatbot on topic grounded in your data. For instance, whether an employee asks, “How many vacation days do I have left?” or “When can I take vacation?”, the chatbot understands the intent and delivers the same accurate response.

...

Feedback Integration

To keep your chatbot accurate and reliable, we embed user feedback loops directly into its workflow. Users upvote/downvote responses via in-chat buttons. For example, if the chatbot misinterprets “How do I reset my VPN?” as a password query, a downvote triggers an alert for admins to review and correct the answer. With sentiment analysis, we identify recurring issues and analytics dashboards allow to track user satisfaction and response accuracy, flagging low-rated answers for review. Managers and admins review flagged responses, edit answers, and upload new training data—like adding synonyms for “VPN reset” to avoid future confusion. We refine and track to ensure quality and performance.

...

AI Training Engineers

We help businesses deploy, train, refine, and scale AI models while maintaining full data control. Our AI training engineers and chatbot developers have hands-on experience in LLM fine-tuning and ensure models deliver accurate, context-aware responses tailored to your industry.

Belitsoft is a software development company and has expertise in data engineering, NLP model training, infrastructure deployment, and security compliance. We cover every aspect of AI implementation.

Our team builds custom ETL workflows to handle structured and unstructured data. Whether it’s cleaning spreadsheets, organizing customer logs, or structuring technical manuals, we process datasets at scale and make them usable for AI training. Our engineers carefully craft AI behavior to understand niche terminology and workflows and use techniques like RAG and prompt engineering to tailor AI behavior.

We have 20 years + experience in secure, enterprise-grade deployment and ensure HIPAA-compliant data handling, robust access controls, and continuous performance monitoring. Deployed on your infrastructure or cloud platforms (AWS, Azure, GCP), we turn raw data into adaptable AI for your business.

For every challenge you encounter,
our AI developers offer a combination of deep back-end expertise and a tailored approach

AI Technologies and tools we use

AI Development
DATA PROCESSING
Batch
Apache Airflow
T-SQL
Argo
Real-time
Kafka
RabbitMQ
RestAPI
DATABASES
PostgreSQL
SQL Server
Redis
ML DEVELOPMENT
Core
Python
NumPy
SQL Alchemy
Pandas
ML
Scikit learn
Deep learning
PyTorch
TensorFlow
PyTorch geometric
NLP
spaCy
NLTK
Hugging Face
Other
GENSIM
OpenCV
DEPLOYMENT & INTEGRATION
FastAPI
Apache Airflow
Argo
Docker
Celery
TensorFlow Lite
ONNX
TensorRT

Custom LLM Training Portfolio

Custom Chatbot Development for a Chatbot Store / PAAS for Bot-Building
Custom Chatbot Development for a Chatbot Store / PAAS for Bot-Building
Today, our chatbots are widely used and help the customers of our Client to deliver the best possible messaging experience to the end-users.
Custom AI Voice-Based Coach Development (Assessment Automation)
AI Voice-Based Coach
Our client is a company involved in software development, IT services, and technology innovation. Over six weeks, we developed an MVP. It provides an efficient knowledge assessment for employees by automating test creation.
Custom Training Software based on Chatbot with Coaching/Mentoring Functionality
Custom Training Software to Develop Leadership Skills in Employees
Our Client, Jeff Otis, a US entrepreneur, turned to Belitsoft to build a unique personal leadership development program. Now, we have launched an MVP of this game-changing personalized interactive web platform with coaching/mentoring functionality.
Custom Chat-Bot and SAAS Web Platform For Lead Generation
Custom Chat-Bot and SAAS Web Platform For Lead Generation
For our client, chief executive officer of a startup company from Germany, we successfully developed a chatbot to convert website visitors to leads and a database application to store them.

Recommended posts

Belitsoft Blog for Entrepreneurs
LLM Training
LLM Training
Types of LLM Training LLM Fine-Tuning Fine-tuning is the training of a general-purpose LLM that has already been pre-trained. Such an LLM has generic knowledge but does not perform well on domain-specific tasks without further training. Fine-tuning feeds an LLM with domain-specific labeled examples to allow it to complete domain-specific tasks without errors and hallucinations. It's a customization of a general-purpose LLM with respect to the expertise area. Such an LLM can better recognize specific nuances, like legal jargon, medical terms or individual preferences of users. Fine-tuning is often split into training, validation, and test sets. The labeled dataset should be relevant to the specific task the LLM must learn to perform, be of high quality (without inconsistent data and duplicates), have sufficient quantity, and be in the form of inputs and outputs. A pre-trained LLM usually consists of different layers (each one processes the input in its own way). For example, ChatGPT-4 has 120 layers. When fine-tuning such an LLM, layers representing general knowledge are kept unchanged, while only the top and later layers are modified according to the task-specific data. The goal is to make the model’s predictions as close as possible to the desired output (the validation dataset is used to measure this). Both automated metrics (BLEU and ROUGE scores) and human evaluations are used to get a 360° view of a model's performance. What do machine learning engineers do in the process of the LLM fine-tuning? They focus on finding: the optimal learning rate (the speed at which ML algorithms adjust some of their parameters automatically) to avoid too high or too low rates. the right batch size (the number of training examples of the algorithm processes before trying to improve the model). This mitigates overgeneralization (when it ignores exceptions or variations), and overfitting (when it memorizes without understanding the underlying principles). the right number of epochs (training iterations) to ensure the model does not train for so long that it begins to overfit. They also use regularization techniques to discourage the model from giving too much importance to one or a few features (characteristics) of the data, and to encourage the consideration of all features more evenly to improve performance on new, unseen data. Retrieval-Augmented Training of Large Language Models Fine-tuned LLMs may become outdated if there is a lot of dynamic data in the domain, as they are tied to the facts in their initial training datasets. They cannot acquire new information and thus may respond inaccurately. Fine-tuning requires a lot of labeled data, and in general, the overall cost of fine-tuning may be relatively higher compared to Retrieval-Augmented Generation (RAG). RAG is an approach that improves existing language models by integrating retrieval capabilities directly into the generation process. From a user perspective, RAG may be referred to as a search engine with a built-in content writer. RAG architecture is used to increase the performance of an LLM by merging the standard generative capabilities with retrieval mechanisms. It works as follows: searching vast external knowledge bases, finding relevant information for a given prompt, and generating new text based on this information. Illustration of RAG process Machine learning engineers often implement RAG by relying on a vector database. The knowledge base is converted into vectors to be stored in this database. When a user submits a query to the LMM, it’s also converted into a vector. The retrieval component searches the vector database for similar vectors. The most similar information is combined with the user query. This forms the augmented query that is ready to be fed into an LLM to let it generate an up-to-date response. RAG prevents the problem of 'best-guess' assumptions and generates factually correct and unbiased responses because it adapts to situations where the information has changed over time. Since it generates information from the retrieved data, it becomes nearly impossible for it to produce fabricated responses. The source of an LLM’s answer can easily be identified based on the references, which are essential for quality assurance. Chatbots with RAG capabilities efficiently retrieve relevant information from an organization's instruction manuals and technical documents (for customer support), up-to-date medical documents and clinical guidelines (for medics), from an institution's study materials according to specific curriculum requirements (for educational institutions), and from a repository of former depositions and legal decisions (for legal professionals). RAG also improves language translation in specialized fields. LLM Training Stages Data Sources Preparation The goal here is to find and prepare data that is sufficient in volume, relevant and focused directly on the target use cases, and relatively high in quality (ready to clean). Example of a labeled dataset with descriptive features and a target feature Data Cleaning At this stage, machine learning engineers remove corrupted data from training data sets, reduce duplicate copies of the same data to a single one, and complete (when feasible) incomplete data by adding missing information. OpenRefine and a variety of proprietary data quality and cleaning tools are available for this purpose.   Data Formatting Models recognize patterns and input-output relationships better if the training data is structured based on specified guidelines. Examples of inputs are customer questions, and outputs are the support team's responses. Machine learning engineers may reformat the source data using JSON. They use custom scripts to expedite the process and manually tweak and clean up where it’s necessary. Adjusting Parameters Transformer-based deep learning frameworks are used to train models, and parameters are customized for these models. Tweaking parameters of how the LLM interprets data is a way to guide it toward behaving in desired ways. The AI team knows exactly which parameters to customize and which ones not to, using methods like LoRA, as well as the best way to customize them. They adjust model weights to indicate the relationship strength between data within a training set. LLM Training Process Machine learning engineers run code that learns from the custom data using previously set parameters. The process may be finished after either hours or weeks, depending on the size of the data.  They train the LLM in a three-stage process that includes self-supervised learning, supervised learning, and reinforcement learning. First, the model reads a lot of texts in your domain on its own. It learns how language works and starts to guess what words/sentences might come next. Then, the model is given examples by our data scientists to learn from. After this, it can follow instructions and do well on new tasks it hasn't seen before. Finally, the model's answers are graded by our staff to teach the model which answers are preferred. Then the trained model is tested. The goal is to have an LLM that is accurate across your domain, consistent, uses natural language, performs well in real-world tasks like problem-solving, and answers factual questions without hallucinations. In the end, the LLM is being integrated with the appropriate real-life application. How Belitsoft Can Help Data Collection and Preprocessing  We help aggregate a diverse dataset of anonymized data from various sources and manage the process to comply with privacy regulations. Annotation and Labeling Experienced subject matter experts should annotate the dataset to ensure that the model will learn from expert interpretations. Model Architecture We employ modern deep learning architectures, designed for specific tasks (image analysis, etc.) and tailored to the characteristics of the dataset (X-ray images among others) to process and interpret data as efficiently and accurately as possible. Training Process and Performance Metrics The model is then being trained using supervised learning techniques, with hyperparameter tuning, to achieve the best levels of performance metrics (accuracy, sensitivity, specificity, area under the ROC curve, etc.). Bias Mitigation We verify whether the training dataset is diverse (e.g., it represents a range of groups, characteristics, and conditions) and adjust training based on results to minimize biases.  Risk Assessment We perform risk analysis to identify failure modes and implement appropriate safeguards. Testing  and Validation Studies First, the model undergoes validation using separate datasets that were not seen during the training. Then, primary users test the tool in real-world scenarios, but within controlled settings, to provide feedback. Integration with Existing Systems, User Training and Support Finally, we integrate the AI tool into business workflows, bringing it live. Using APIs, including OpenAI API integration, we connect AI models with your existing software ecosystem. Our expertise covers integrating AI-powered systems with CRMs, ERPs, data lakes, and proprietary platforms to optimize performance. Training materials and support services are also provided to help users effectively utilize the tool. Periodic Re-training, Change Management and Version Control To retrain the model, we gather new data (either from user feedback or the latest findings). Any modifications to the model follow a structured process that includes re-validation and documentation. We also maintain strict version control, recording changes between versions to trace updates. Regulatory compliance and Data Security All model updates are evaluated for regulatory impact to maintain compliance, for example, adherence to Good Machine Learning Practices guidelines. Naturally, we implement security protocols to safeguard sensitive data during transmission and storage. Belitsoft provides full-cycle LLM training services to power generative AI applications, from custom AI chatbots to domain-specific assistants. We fine-tune large language models with proprietary datasets, apply prompt engineering techniques for optimal responses, and optimize training pipelines for efficiency. Contact us today to discuss your needs.
Dmitry Baraishuk • 6 min read
LLM Pretraining
LLM Pretraining
Pre-training is the First Step in Training an LLM Training a large model from scratch is computationally expensive, requiring multiple state-of-the-art GPUs. For this reason, most developers won't pre-train a model from scratch and will instead take an existing model and use fine-tuning to adapt it to their own tasks. However, there are still some situations where pre-training a model may be required or preferred. Some want to build models for tasks in specific domains like legal, healthcare, and e-commerce. Others need models with stronger abilities in specific languages.  Further, new training methods are making more efficient pre-training possible like Depth Upscaling, which uses two or more sets of existing models to build larger models. Because of this technology improvement, there is more and more interest in pre-training. Depth Upscaling creates a new larger model by duplicating layers of a smaller pre-trained model. The new model is then further pre-trained, resulting in a better, larger model than the original. Models created in this way can be pre-trained with up to less compute than traditional pre-training, representing a large cost saving.  Whether pre-training is the right solution for your work depends on several factors, such as whether a model might already be available that might work for your task without pre-training, and what data you have available, as well as the compute resources you have access to, both for training and serving, and lastly, the privacy requirements you may have, which may also implicate regulatory compliance requirements. Pre-training large models of large datasets is an expensive activity, with a minimum cost of maybe about $1,000 for the smallest models, up to tens of thousands of dollars to hundreds of thousands of dollars for maybe a billion parameter scale model. So do be careful if you choose to try this out yourself. There are calculators, like one from Hugging Face, that can help you estimate the cost of your pre-training scenario before you get started. These can help you avoid unexpected large bills from your cloud provider. Best Use-Case for Pre-training Pre-training is the first phase of training on LLM, where the model learns to generate the text by using a very large amount of unstructured text data. Each text sample is turned into many input-output pairs. Over time, the model learns to correctly predict the next word, and in doing so, the model incurs knowledge about the world. These base models are suitable at generating text, but not always good at following instructions or behaving in a safe way. The LLMs you encounter in consumer applications like ChatGPT, Bing Search, and others have had their initial pre-training extended with a phase of fine-tuning to make them better at following instructions and alignment with human preferences to make them safe and helpful. The model only has knowledge of the content that was in the training data, so if you want the model to learn new knowledge, you have to do more training on more data. Additional fine-tuning or alignment training is useful to teach the model new behavior, say writing a summary in a specific style or avoiding a particular topic. However, if you want the model to develop a deep understanding of a new domain, additional pre-training on text from that specific domain is necessary. People often try to add new knowledge without pre-training, focusing on fine-tuning the model with smaller datasets. However, this doesn't work in every situation, especially if the new knowledge is not well represented in the base model. In those cases, additional pre-training is required to get good performance. Let's take a look at a specific example. For instance, let's say you want to create an LLM that is good at a specific language. A base model that wasn't trained on much text from this language, for example, the Lama7b model cannot write text in this language. If you ask the model to tell us about some native term, it gets the answer completely wrong. The model fine-tuned on a small amount of data, can answer only partially in this specific language, however, the answer will actually make sense. The model created by further pre-training LLM, on a huge amount of unstructured text in the language of your interest can now speak this language fluently. So as you can see, pre-training is critical here to getting a good language model. How can we make the results better? Some people will think of fine-tuning. Fine-tuning involves training your model on a small amount of data, which is task specific. It is important to note that in contrast to fine-tuning, which can sometimes be done using a few hundred thousand tokens, and it can be quite cheap, pre-training requires lots of data and so is expensive. The cost to train the 248 million parameter model carried out on 16 H100 GPUs, may take seven hours and cost $1,500 on AWS. LLM Data Cleaning When pre-training a model, it is important to start with a high-quality training dataset. The datasets used for pre-training LLMs are made up of vast amounts of unstructured text. Each text sample is used to train an LLM to repeatedly predict the next word, known as autoregressive text generation. During the training phase, the model's weights are updated as it processes each example in the training data, until over time, the model becomes good at predicting the next word. You can think of this phase as being like reading, where the input texts are used in their original form without any additional structuring of the training samples. Huge amounts of training text, equivalent to millions of completed books, are required for language models to get really good at next-word prediction and to encode reliable knowledge about the world. In contrast, the data used for fine-tuning is highly structured. For example, question-answer pairs, instruction-response pairs, and so on. So, the form of the fine-tuning sample is quite different. The goal of fine-tuning is to get the model to behave in a certain way or to get good at completing a particular task. If pre-training is like reading many, many books, you can think of fine-tuning as being like taking a practice exam. You aren't really learning new knowledge. You learned everything from your reading and pre-training. Instead, fine-tuning is just learning how to answer questions in a specific way. If you want to read a lot of text, you have to find a lot of books and code examples and articles and Wikipedias, webpages, and extra. Actually, pre-training datasets are built from large collections of text documents, many of which are sources from the internet. The world is filled with text, so it's quite easy to find lots of text for pre-training. Fine-tuning of the datasets, on the other hand, requires precise questions and high-quality corresponding answers. Traditionally, this work has been done by humans, which takes time and can be expensive. More recently, teams have been using LLMs to generate fine-tuning data, but you need to use a very capable model for this to work well. In fact, you need to do a bit more work to create good-quality fine-tuning datasets. You will compare and contrast some sample pre-training and fine-tuning datasets. Data quality is very important for pre-training LLMs. If there are issues with your training data, for example, lots of duplicate examples, spelling errors, factual inconsistencies or inaccuracies, and toxic language, then your resulting LLM will not perform well. Taking steps to address these issues and make sure that your training data is of high quality will result in a better LLM and more return on your training investment. Here are major tasks you should complete to clean your text data for training. The first is the duplication. Having duplicated data can bias your model towards particular patterns and examples. It also increases your training time while not necessarily increasing the model performance. Thus, removing duplicate text is a crucial step in cleaning your data. This should be within individual documents and across all documents. You want the intrinsic quality of your training data to be high. The text should be in the language you are interested in, be relevant to any topics you want the LLM to build knowledge of, and meet any other quality metrics that you have. You can design quality filters to clean up this aspect of your training data. A relative step is applying content filters to remove potentially toxic or biased content. Safety is an important concern. And then to avoid potential data leakage, you should always remove personally identifiable information or PII for any of your examples. One common strategy is to redact this in the training text. Lastly, you can come up with rules for how to fix common quality issues like all caps, extra punctuations, and poorly formatted text. As you can see, data cleaning can be complicated and takes lots of time. Luckily, more and more tools are available to help you with this important step. One example is Dataverse, an open source project. Dataverse is a ready-to-use data cleaning pipeline that will take your raw training data, apply the cleaning steps, and also other ones, and then package up your data in a way that is ready for training. You can take a look at the GitHub page to learn more about how to use Dataverse. Data cleaning steps Started with data collection. Since the objective of pre-training is to perform the next token prediction, you need a gigantic corpus of unlabeled data. You can often acquire this data by scraping from the web, gathering documents within your organization, or simply downloading open datasets from data hubs. The content itself is not important. The important part here is that it is an example that consists of plain text data. For pre-training, this is what we want. We want plain text that is not structured in any kind of instruction type. For example, a question-answer pair. Feel free to change the index number here if you want to explore any other example within the dataset. Now let's download another dataset called Alpaca. Alpaca is a fine-tuning dataset which contains 52,000 instruction-following data generated by GPT-4. Here you can see the dataset consists of an instruction, an input, and an output. Let's see what an example looks like. Here we are going to see the first example and print the instruction, input, and output. It's three tips for staying healthy. Note that in contrast to the pre-training dataset, which comprises solely of the text, this instruction dataset, Alpaca, includes the instruction, input, and output columns. Since we are interested in pre-training, we will choose to only use the pre-training dataset from now on. Now let's try scraping from the web and form a custom dataset. To do this, we will download nine random Python scripts. However, note that in practice, you will have much, much more samples, up to billions. This is a very practical action you will do when you're pre-training your own model. You will download some data, add some custom data, and combine. Now we have a total of 60,009 rows. Let's go through some typical steps for data cleaning and see how the number of rows decrease as we progress. First, we will filter out samples that are too short. This is a function describing a common practice for pre-training data. Simply put, we keep text that has at least three lines or sentences and each line of the text contains at least three words. We want to do this because our objective in pre-training is to predict the next token, but short examples are not very useful for that task. So let's try running this function. Note that the dataset library has a filter method which applies a function to each example in the dataset. If you check the number of rows, you can see that over 7,000 rows got eliminated. Now we'll move on to the second part where we remove repetitions. So this is basically a function where, given an input of paragraphs, you can find duplicates. We use this function to find repetitions within a paragraph and say if compared to a paragraph's length, a paragraph has too many duplicates, then we return false to get rid of that paragraph. We will run this function throughout the dataset. Now we're down to 52,000 examples, which is a decrease of 30 rows. That is a tiny decrease, but this is one advantage where you download datasets from HuggingFace because datasets on HuggingFace have a lot of the pre-processing done already. And for the third part of pre-processing, let's go on to deduplication. This function removes duplicate entries by storing unique text segments and comparing each text against it. Let's try running that function. As a result, 8,000 rows were removed, and that is a big decrease. In reality, there is also a lot of duplication in documents, so make sure you cover this step. The last step is language filtering. This is one of the quality filters that Sung previously mentioned. If you want to focus on a particular language or domain, it is good to filter out other languages or domains so that the model is trained on relevant text. Here we'll use the FastText language classifier to only keep English samples to train our model. You will see this warning, but don't worry about it too much. Also note that the run is slower than the filters that we run above. That is because this is actually a real machine learning model in action. Let's check the number of rows. Now we're down to 40,000 after removing approximately 3,000 rows. Here, I would like to note that starting from a large data set from the first place is very important because you are constantly throwing out rows by cleaning out the data set. Finally, we will save the data in the local file system in a parquet format. Note that in reality, you would want to save the data in each stage of cleaning because you're handling a large amount of data and data cannot be contained in memory. Parquet is a columnar storage file format that is widely used in big data and data analytics scenarios. You're free to use any other format like CSV or JSON, but since parquet is really fast, we're choosing it here. The next step in the process is to prepare your saved data set for training. This involves some additional manipulations of the data. Data tokenizing and packing Now that you have your clean data set, you need to prepare it for training. There is a bit more manipulation of the data that you have to do before you can use it in a training run. The two main steps are tokenizing the data and then packing it. LLMs don't actually work directly with the text. Their internal calculations require numbers. Tokenization is the step that transforms your text data into numbers. The exact details of how text is mapped to tokens depends on the vocabulary and the tokenization algorithm of your model. Each model has a specific tokenizer, and it is important to choose the right one or your model won't work. Packing structures the data into continuous sequences of tokens, all of the maximum length of the model support. This reshaping makes training efficient. Let's start with tokenizing. You can choose any one from an existing model hosted on HuggingFace or create your own. Many times you will see models in the same family use the same tokenizer. In this case, we will be using TinySolr's tokenizer from Solr, which is in the same family. Now we are going to calculate the total number of tokens in our dataset. When training LLMs, we are often interested in calculating the total number of tokens, and we can easily check this with NumPy. So with this small dataset that actually started out with approximately 4,000 text samples, you actually have 5 million tokens. Let's pack our dataset. So we now have our clean data tokenized and packed into the right shape for training. Model Training Decode-only or autoregressive models Now you need a model to train. There are several ways to configure and initialize a model for training. And your choice will impact how quickly pre-training proceeds. Although there are several variations of the transformer architecture used in large language models, we're focusing on decode-only or autoregressive models. The decoder-only architecture simplifies the model and is more efficient for the next token prediction. OpenAI's GPT models and the most other popular LLMs, Llama, Mistral have adapted a decoder-only architecture. A decoder-only model is made of an embedding layer that turns text to vector representations, and then several decoder layers, each of which contains several different parts that are based on neural networks. Lastly, the model ends with a classifier layer that predicts the most probable next token from the vocabulary. Initialize the weights Once we decide the architecture, the next step is to initialize the weights. These weights get updated during training as the model learns to predict the next token from the examples in the training data. There are a few ways that you can initialize the weights. The simplest choice is to initialize the weights with random values. This is okay, but it means that training takes a very long time and requires a huge amount of data. Actually, a better way is to reuse existing weights. For example, you can start from Llama7B or Mistral 7B weights. This means your model has already been trained and gets some basic knowledge, so you can generate text very well already. This is the best way to start if you want to continue pre-training a model on new domain data. Training in this scenario generally takes much less data and time than starting from random weights, but still it's much more data than fine-tuning. With all the open models out in the world right now, this can be a great option for creating your own custom LLM. We used exactly the same size, but we put more data. In this training, we used 200 billion tokens. And then the hyperparameters we used in here. These hyperparameters are also very different from fine-tuning. The total price here is 0.2 million. So if you see the price, it's still expensive, but it's much, much cheaper than starting training from scratch. Here, we used 1 trillion tokens, so the approach was more expensive. It cost us about 1 million. However, this is actually much less data than needed to train a model of this size from scratch, which would be around 3 trillion tokens. Model Scaling And you might notice that our model has 10 billion parameters, which is not the same size as the trial weight that we initialized the model with. We found that the 7 billion model, which was available, was not quite good enough for our purposes. But we were limited by our hardware to train a model less than 13 billion. So we took advantage of a technology called "model scaling" to create a new model with a different size. Model scaling removes or adds layers to an existing model and then carries out more training to create a new model with a different size. What if you want to make a smaller model? One option is called downscaling. Downscaling involves removing layers to produce a smaller model than the one you started with. This approach can work well for large models, but it doesn't work well for small models. In general, layers near to the middle of the models are removed, and then the resulting smaller model is pre-trained on a large body of the text to bring its weight back into coherence. The better method is called upscaling. Here you start with a smaller model, then duplicate some layers to make a larger model. Let's take a look at an example. To make a 10 billion model with upscaling, you can start with a 7 billion model. For illustration, let's assume the 7B model has 4 layers. In reality, Lama 7B, for example, has 32 layers. You can make two copies of the model, then use some top layers from one copy and then some bottom layers from the second copy and put them together to create a new model with 6 layers. At this point, the model is no longer coherent. Inference would not work well. Continued pre-training is required to bring the model back into coherence and enable text generation. However, because the model weights of the copied layers have already encoded some language understandings and knowledges, it takes less data and time to create a good model. In fact, upscaling can allow you to train a larger, well-performing model with 70% less data than training the equivalent model from scratch. So depth upscaling can actually be a more cost-effective way to pre-train a model, although it's still expensive. Let's take a look at how you can create models using each of these methods. Let's begin as before by setting a configuration to minimize warnings and by setting a seed for reproducibility. The models we will be creating here will be based on Meta Llama 2 architecture, a decoder-only model that is one of the most frequently used architectures by LLM developers. You can set configuration options using the LLAMAConfig module of the Transformers library. We will reuse most of the parameters of the original LLAMA2 model. But since we want to run our model with limited computation, let's adjust some parameters to reduce the model size. We will be setting the number of hidden layers to 12 and shrinking the model in terms of hidden size, intermediate size, and number of key value heads. Experimenting with these settings is hard because pre-training takes so much time and is expensive. The best place to look for advice on designing a model's architecture is the academic literature. So look for papers on the archive and in conference proceedings. Now that we have determined our model configurations, let's initialize the model. The first and most naive way to initialize a model would be to initialize it with random weights. Initializing a model from random weights is very easy with the Transformers library. All you need to do is pass on the config we've just defined in create an instance of LLAMA. Before we move on, let's check the size of the model. When training an LLM, we always want to make sure of the size of the model because the size directly impacts compute and cost. So our current model is sized at 248 million parameters. When a model is randomly initialized, the weights are given random. Let's take a look at a small sample of weights from one of the layers in the self-attention head. The model is randomly initialized and not trained on any data. Do you want to try it for inference? Can you guess what it will output? So you've seen this happen before. We are first going to load a tokenizer. You will see some random outputs because our model is not trained yet. Before we move on, let's release the memory. This is because these models we created take up to several hundred megabytes and we need to release the memory to avoid crashing the kernel. Now, instead of random weight initialization, let's try using a pre-existing pre-trained model. All we need to do is load the model using auto model for causal LLM and we are ready to keep training. Taking an existing model and continuing to train it on new data is called continued pre-training and is a much faster way to train a model on new data than starting from scratch. Before we move on, let's empty the memory once more. Earlier, we showed how you can remove layers from a large model to create a smaller one in a process called downscaling. Here's how you can do that. You will be shrinking a 12-layer 248 million size model by removing the mid-layers. To start, let's check how many layers the model currently has. You can see that the model currently has 12 layers and has 248 million parameters. Now let's create a smaller model from our initial model by deleting two of the mid-layers. Here we will be selecting the first five layers and the last five layers and concatenating them to form a total of 10 layers. Now you have 10 layers left, which is what we wanted. So now this model configuration is ready to start using for pre-training. As you heard earlier, downscaling works best with larger models. This small model here would not be a good choice and is only being used to show you the method. Let's go ahead and empty our memory once more. So now you are going to try upscaling a pre-existing pre-trained model. By upscaling, we mean that we start from a small pre-trained model and end up with a larger model. Here we will be upscaling a model with 10 layers to a model with 16 layers. The first step is to create a model instance for the large final model we are going to train. So these are the basic configurations for the larger model. As above, we start with the Llama2 model architecture. And all numbers other than the number of hidden layers are the same as the smaller pre-trained model we are going to upscale. Let's finish this part up by initializing the larger model with random weights. Next, you are going to overwrite these randomly assigned weights using the weights from a pre-trained model. So let's load the smaller pre-trained model into memory so you can copy layers from it. Here you will use LLM, which has 12 layers to upscale to our 16-layer model. First, you'll take the bottom-most 8 layers and the top-most 8 layers and concatenate them to form a total of 16 layers. You'll then overwrite the weights of the randomly initialized model with these new values. Lastly, these lines of code here copy over the components that make up the embedding and classification layers for the model. So those can be used as well. Let's check the number of parameters to confirm that it hasn't changed. Let's also try inferencing the model. Now this is interesting. The model has been initialized with another model's weights, so it has some ability to generate what we need. But the layers are not yet coherent, so the generation isn't good. This is why it's necessary to continue pre-training this model on more data. But as you can see here, you are much further along than when you started with random weights. This is why upscaling can help you train models much faster. Then during training, you'll be updating all the weights of this model so all of the layers work together as expected. Let's save this model and then train it.
• 17 min read
How to Train a Logistic Regression Model
How to Train a Logistic Regression Model
What is a Logistic Regression Model? Building classification models is one of the common tasks in NLP development. One common classification task is sentiment analysis. It involves classifying text as positive, or negative. For example, you can build a system based on a classification model that automatically goes through thousands of product reviews written by users to figure out how many are positive reviews and how many are negative. There are many models for sentiment analysis. One of them is based on the logistic regression algorithm. The logistic regression algorithm is easy to train. It also provides a good baseline result to measure performance before trying more complex models on the same data for comparison. #1 Feature Extraction Raw text cannot be directly used by logistic regression or other machine learning models. You need to represent text data numerically, in other words, vectorize it using a vocabulary, or in other words, extract features from text data.  Extracting features is typically the first step in the process of training a logistic regression model dealing with unstructured text data. For example, if we have the vocabulary ["I", "am", "happy", "sad"], the features will be individual words "I", "am", "happy", and "sad" from that vocabulary. And feature extraction will be converting words from analyzed sentences into numbers. That's why, the vector representation of the sentence "I am happy" will be [1, 1, 1, 0], where "1" means that the word from vocabulary exists in the sentence, and "0" means it doesn't. The binary vector method (1/0) is mostly used to explain what a vocabulary is, what features are, and how to convert text into a vector (so machine learning models can process it). It's useless for real-life scenarios. Create a Vocabulary for this NLP Task Vector is commonly represented as a list of digits enclosed in brackets. Before representing a text as a vector, you have to build a vocabulary, or the list of unique words from all your raw reviews, by going through all the words from all these texts and saving every new word to the vocabulary, removing all repeated words. Extract Features The process of extracting features is based on the previously created vocabulary.  We check if every word from the vocabulary appears in the text (in our case, the review). If it does, that word or feature gets the value “1”, if it doesn't, it gets “0”. Sparse Data Problem The issue of sparse representation of data refers to situations where vectors contain a high proportion of features equal to zeros.  It means that our vocabulary is very large because we have a lot of text with different words, but each separate text we want to classify contains only a few words from the vocabulary that can be represented as “1”.  All other words from the vocabulary aren't present, but we still must represent them as '0' in the vector for this text. The larger the vocabulary, the more time is necessary to train your model, and the longer it will take to make predictions.  This is because the model must compare more and more words from each analyzed text to all the words in the vocabulary (to understand how each word's weight contributes to the classification of the whole text as positive or negative).  Even if words from the vocabulary are not present in the analyzed text, they still have weights and must be represented in the feature vectors as zeros. Frequency-based Feature Selection Method for Classification In machine learning within NLP, feature selection means choosing which features (numerical inputs) you will use in your model. The word "selection" refers to deciding which features to include. With full vocabulary vectors, each word is a feature (potentially thousands of features). This brings the sparse data problem.  To avoid that, you can select a better option and use only two features — the total positive frequency and the total negative frequency. This is called frequency-based feature selection because you select not words but word frequency sums as your features instead. Note: This article is not the finalized version and will be completed soon.
Dzmitry Garbar • 3 min read

Our Clients' Feedback

zensai
technicolor
crismon
berkeley
hathway
howcast
fraunhofer
apollomatrix
key2know
regenmed
moblers
showcast
ticken
Next slide
Let's Talk Business
Do you have a software development project to implement? We have people to work on it. We will be glad to answer all your questions as well as estimate any project of yours. Use the form below to describe the project and we will get in touch with you within 1 business day.
Contact form
We will process your personal data as described in the privacy notice
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply
Call us

USA +1 (917) 410-57-57

UK +44 (20) 3318-18-53

Email us

[email protected]

to top