Stage M
Context and Motivations: Text generation using Large Language Models (LLMs) has become a powerful tool in natural language processing, leveraging vast datasets to produce human-like text across various domains. LLMs like GPT-3, LLaMA, and BLOOM have been trained on enormous corpora of text data, enabling them to generate coherent and contextually relevant content[1]. Some of the most popular datasets used for training these models include Common Crawl, a massive web-crawled dataset used in training GPT-3 and LLaMA, and RefinedWeb, a high-quality subset of Common Crawl containing over 5 trillion tokens[1].
The Pile, an 800 GB diverse corpus from 22 sources, has been instrumental in enhancing the generalization capabilities of models like GPT-Neo and OPT[1]. For more specialized applications, datasets like Starcoder Data, which contains 783 GB of code in 86 programming languages, have been used to train models focused on program synthesis[1]. Additionally, multilingual datasets such as ROOTS, a 1.6 TB corpus covering 59 languages, have been crucial in developing models like BLOOM that can generate text in multiple languages[1]. These diverse datasets enable LLMs to perform a wide range of text generation tasks, from creative writing to code generation, demonstrating the versatility and power of these AI systems in producing high-quality textual content.
In this project, our goal is to develop AI-driven content generation for event promotion using Large Language Models (LLMs), transforming the way event organizers promote their events within the Déclic app. These sophisticated AI tools can automatically craft engaging and creative event descriptions that capture the essence of each occasion, saving organizers valuable time and resources. By analyzing vast datasets of successful event promotions, current trends, and user preferences, LLMs generate compelling content that resonates with target audiences.
This automated approach not only ensures consistency in messaging across all event listings but also allows for rapid scaling, enabling organizers to promote multiple events simultaneously with high-quality, personalized content. Moreover, LLMs can incorporate SEO-friendly elements and emotional triggers in the descriptions, enhancing discoverability within the app and increasing the likelihood of user engagement.
We are a cutting-edge startup developing a new social network. As part of our commitment to innovation, we are offering an internship opportunity for a talented and motivated data scientist to join our team. For more information and to download the app, please connect to https://declic.net/
This internship is in collaboration with the University Paris-Est Créteil (UPEC) and the research laboratory LISSI (Laboratoire Images, Signaux et Systèmes Intelligents).
Internship Location: Université Paris-Est Créteil, Laboratoire Images, Signaux et Systèmes Intelligents (LISSI), 122 rue Paul Armangot, 94400 Vitry sur Seine
Duration of Internship: 6 months
Profile:
– Currently pursuing a Master’s or Engineer’s degree in Data Science, Machine Learning, Computer Science, or a related field.
– Strong programming skills in languages such as Python, and experience with relevant libraries and frameworks (e.g., TensorFlow, PyTorch).
– Understanding of LLms algorithms, previous experience in implementing and evaluating such models is a plus.
– Familiarity with data preprocessing, feature engineering, and model evaluation techniques.
– Ability to work independently and collaboratively in a dynamic team environment.
– Excellent problem-solving and communication skills.
How to Apply:
If you are passionate about data science and excited to work on cutting-edge recommendation models, please send your CV and a cover letter to alice.othmani@u-pec.fr with the subject line « Internship Application – Data Scientist – Content Generation for Event Promotion using LLMs. » Don’t forget to include any relevant projects or work samples and school reports (relevés de notes).
N.B. This internship can lead to a permanent R&D engineer position or a PhD scholarship.
References:
[1] https://kili-technology.com/large-language-models-llms/9-open-sourced-datasets-for-training-large-language-models
[2] https://trainingdata.pro/datasets/llm-text-generation
[3] https://github.com/kasnerz/quintd
[4] https://www.kaggle.com/competitions/llm-detect-ai-generated-text
[5] https://aclanthology.org/2023.acl-long.34
[6] https://www.projectpro.io/article/llm-datasets-for-training/1027
[7] https://github.com/Zjh-819/LLMDataHub
[8] https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms