With great scientific breakthrough comes solid engineering and open communities. The Natural Language Processing (NLP) community has benefited greatly from the open culture in sharing knowledge, data, and software. The primary objective of this workshop is to further the sharing of insights on the engineering and community aspects of creating, developing, and maintaining NLP open source software (OSS), which we seldom talk about in scientific publications. Our secondary goal is to promote synergies between different open source projects and encourage cross-software collaborations and comparisons.
In the earlier days of NLP, linguistic software was often monolithic and the learning curve to install, use, and extend the tools was steep and frustrating. More often than not, NLP OSS developers/users interact in siloed communities within the ecologies of their respective projects. In addition to the engineering aspects of NLP software, the open source movement has brought a community aspect that we often overlook in building impactful NLP technologies.
Top NLP Open Source Projects For Developers In 2020
We hope that the NLP-OSS workshop becomes the intellectual forum to collate various open source knowledge beyond the scientific contribution, announce new software/features, promote the open source culture and best practices that go beyond the conferences.
This talk covers what it means to operationalize Machine Learning (ML) models. It starts by analyzing the difference between ML in research vs. in production, ML systems vs. traditional software, as well as myths about ML production. It then goes over the principles of good ML systems design and introduces an iterative framework for ML systems design, from scoping the project, data management, model development, deployment, maintenance, to business analysis. It covers the differences between DataOps, ML Engineering, MLOps, and data science, and where each fits into the framework. The talk ends with a survey of the ML production ecosystem, the economics of open source, and open-core businesses.
Spencer Kelly is the author of compromise, - a small natural language processing library for the browser. He is a web developer, and maintainer of open-source libraries. His background is in the semantic web and Wikipedia. Today his work focuses on creating infographics. His open-source work is funded by freelance web development. He is from Toronto, Canada.
Lucy is a NLP/ML engineer at Upstage. She has participated in some open source projects, particularly KoNLPy which is a tool for Korean NLP, and is also interested in open data. She received her Ph.D. in Data Mining from Seoul National University in 2016, where she has pursued various studies on text mining in the fields of manufacturing, political science, and multimedia. After her studies, she joined NAVER, a South Korea based search-engine company and worked on machine translation.
The year 2019 was an excellent year for the developers, as almost all industry leaders open-sourced their machine learning tool kits. Open-sourcing not only help the users but also helps the tool itself as developers can contribute and add customisations that serve few complex applications. The benefit is mutual and also helps in accelerating the democratisation of ML. Here we have compiled few open-source NLP projects that would be exciting both for the developers as well as the users:
AllenNLP is an open-source NLP research library, built on PyTorch. AllenNP made designing and evaluating new deep learning models easy for any NLP problem. It can also be run efficiently on the cloud or the laptop. AllenNLP is built and maintained by the Allen Institute for AI, in close collaboration with researchers at the University of Washington and its users that includes include Facebook research, Airbnb, Amazon Alexa and other top players of the industry.
Rasa is an open-source framework to build high-performing, resilient, proprietary contextual assistants. It provides the necessary infrastructure to create great assistants that can understand messages and create meaningful conversations; employ machine learning to improve those conversations; and integrate it seamlessly with existing systems and channels.
Sure, you can do all of the things with your laptop or EKS or GKE or whatever but what if you just want to putz around with a few containers? Then you can just go to Play with Docker and do the things. While you cannot start running your new startup to do security, AI, or analytics (all new startups do those things now) because of a five-instance, four-hour limit, Play with Docker is a good place to try something out before you fully commit. And because maybe you do not want to expose yourself in public (always a bad idea), maybe you want to install an internal version of Play with Docker from the open source (MIT licensed) repository on GitHub so people in your organization can putz around?
Alongside security, error and performance tracing are among the most frustratingly inevitable requirements for many apps. Cue a sigh of relief. Sentry offers an entire ecosystem of open source tools for monitoring the health of applications, services, and APIs, from the server-side API for collecting data, to a dashboard for making it manageable, to a comprehensive slew of application-side integrations.
Appsmith is a low-code framework that helps back-end developers customize software like admin panels, forms, and dashboards with minimal HTML and CSS coding. The platform jumpstarts projects with pre-built UI components and reusable templates, integrates with a broad range of APIs, data sources, and cloud services, and supports both cloud and self-hosting deployment options. Appsmith boasts more than 10 million downloads on Docker, more than 21 thousand stars on GitHub, and recently announced $41 million in Series B funding. Example use cases include customer support tools and internal processes such as communications.
Built by Traceable on Apache Kafka, Hypertrace is an open-source, distributed tracing and observability engine capable of ingesting and processing huge volumes of real-time performance data from large numbers of services across sprawling cloud-native architectures. Hypertrace monitors your applications and microservices, tracing distributed transactions across their multiple touchpoints, and distills all of this information into service metrics and application flow maps, which it displays in fully customizable dashboards.
Copyright The Linux Foundation. The PyTorch Foundation is a project of The Linux Foundation. For web site terms of use, trademark policy and other policies applicable to The PyTorch Foundation please see www.linuxfoundation.org/policies/. The PyTorch Foundation supports the PyTorch open source project, which has been established as PyTorch Project a Series of LF Projects, LLC. For policies applicable to the PyTorch Project a Series of LF Projects, LLC, please see www.lfprojects.org/policies/.
The Text Analysis API streamlines the data mining process for developers and businesses so that they can quickly classify data from a variety of sources. The Text Analysis API allows developers to perform various tasks such as summarization, language detection, sentiment analysis, article extraction, named entity recognition, and extract text from various documents, images and audio files.
Dan BarkerNatural language processing (NLP), the technology that powers all the chatbots, voice assistants, predictive text, and other speech/text applications that permeate our lives, has evolved significantly in the last few years. There are a wide variety of open source NLP tools out there, so I decided to survey the landscape to help you plan your next voice- or text-based application.
Judging from the jobs boards, The open-source fastText library and pretrained models are popular with many companies. It employs a word embedding method and is an extension of the word2vec model. Although deep neural network methods for NLP are now popular, they can be slow to train and test. FastText helps solve this problem by employing a hierarchical classifier instead of a flat classifier, and can be orders of magnitude faster especially if you have many categories.
Hugging Face is a company creating open-source libraries for powerful yet easy to use NLP like tokenizers and transformers. The Hugging Face Transformers library provides general purpose architectures, like BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, and T5 for Natural Language Understanding (NLU) and Natural Language Generation (NLG). It currently includes thousands of pre-trained models in 100+ languages. These models are both easy to use, powerful and performant for many NLP tasks. Model training, evaluation, and sharing can be achieved through a few lines of code.
One open source project that embraces this approach is Hugging Face. I recently sat down with Julien Chaumond (co-founder and CTO), Jeff Boudier (Product and Growth), and Philipp Schmid (Technical Lead) to understand more about the origins of Hugging Face, a New York-based startup that helps developers accelerate the use of natural language processing (NLP) technologies. Hugging Face is a great story around how a group of passionate developers came together to help accelerate the adoption of new and emerging technologies, improve developer agility, and provide developers choice.
As Hugging Face developers were creating their first consumer product, they decided to explore open sourcing some of the building blocks. One such building block was the coreference resolution system, a library that helps you understand the relationship of pronouns in a sentence.
The Hugging Face team chose to open source this library because they wanted to get feedback from the community. In fact, Jeff Boudier said that it was important to get as diverse a set of people involved and contributing, particularly given the wide variation in how language is used. They were surprised at how well this move was received by the community, which convinced them that releasing more open source libraries was the way to go. 2ff7e9595c
Comments